KenLM Model Documentation

Introduction

KenLM models are probabilistic n-gram language models used for tasks such as fast perplexity estimation, aiding in filtering or sampling large datasets. For example, these models can filter out samples in a dataset that are unlikely to appear in a base dataset like Wikipedia due to high perplexity scores.

Architecture

The repository contains several KenLM models trained on tokenized datasets across various languages. Each language-specific model comprises three files:

{language}.arpa.bin: The trained KenLM model binary.
{language}.sp.model: The SentencePiece model for tokenization.
{language}.sp.vocab: The vocabulary file for the SentencePiece model.

These models are trained using preprocessing steps such as replacing numbers with zeros and normalizing punctuation, based on cc_net's methods.

Training

Models are trained on datasets like Wikipedia and OSCAR, with specific preprocessing to ensure consistency, including keeping default values for parameters like lower_case, remove_accents, normalize_numbers, and punctuation during inference.

Guide: Running Locally

Dependencies:

Install KenLM:

pip install https://github.com/kpu/kenlm/archive/master.zip

Install SentencePiece:
```
pip install sentencepiece
```

Load and Use the Model:

Example Python code to load a model and get perplexity:

from model import KenlmModel

# Load model trained on English Wikipedia
model = KenlmModel.from_pretrained("wikipedia", "en")

# Get perplexity
print(model.get_perplexity("I am very perplexed"))  # Outputs a low perplexity score
print(model.get_perplexity("im hella trippin"))  # Outputs a high perplexity score

Cloud GPUs: For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure to handle larger datasets efficiently.

License

The KenLM models and associated files are released under the MIT License.

More Related APIs