KenLM Model Documentation

Introduction

KenLM models are probabilistic n-gram language models used for tasks such as fast perplexity estimation, aiding in filtering or sampling large datasets. For example, these models can filter out samples in a dataset that are unlikely to appear in a base dataset like Wikipedia due to high perplexity scores.

Architecture

The repository contains several KenLM models trained on tokenized datasets across various languages. Each language-specific model comprises three files:

  • {language}.arpa.bin: The trained KenLM model binary.
  • {language}.sp.model: The SentencePiece model for tokenization.
  • {language}.sp.vocab: The vocabulary file for the SentencePiece model.

These models are trained using preprocessing steps such as replacing numbers with zeros and normalizing punctuation, based on cc_net's methods.

Training

Models are trained on datasets like Wikipedia and OSCAR, with specific preprocessing to ensure consistency, including keeping default values for parameters like lower_case, remove_accents, normalize_numbers, and punctuation during inference.

Guide: Running Locally

  1. Dependencies:

    • Install KenLM:
      pip install https://github.com/kpu/kenlm/archive/master.zip
      
    • Install SentencePiece:
      pip install sentencepiece
      
  2. Load and Use the Model:

    • Example Python code to load a model and get perplexity:
      from model import KenlmModel
      
      # Load model trained on English Wikipedia
      model = KenlmModel.from_pretrained("wikipedia", "en")
      
      # Get perplexity
      print(model.get_perplexity("I am very perplexed"))  # Outputs a low perplexity score
      print(model.get_perplexity("im hella trippin"))  # Outputs a high perplexity score
      
  3. Cloud GPUs: For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure to handle larger datasets efficiently.

License

The KenLM models and associated files are released under the MIT License.

More Related APIs