llm data textbook quality fasttext classifier v2

kenhktsui

Introduction

The LLM-DATA-TEXTBOOK-QUALITY-FASTTEXT-CLASSIFIER-V2 model is a fastText-based tool designed to classify the educational value of web text. It categorizes text into three levels: High, Mid, and Low educational value, offering a granular approach to data curation for large language model (LLM) training.

Architecture

The model is built using fastText, allowing it to classify over 2000 examples per second using a CPU. This efficiency makes it suitable for on-the-fly data filtering during pretraining. The classifier distinguishes between High, Mid, and Low educational value, enhancing the granularity of text evaluation.

Training

The model is trained using raw web text data, focusing on educational content. It draws inspiration from the "Textbooks Are All You Need" approach, aiming to identify and filter high educational value data for LLM pretraining. The classifier's effectiveness is validated through various benchmarks and comparisons with synthetic and real datasets.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Dependencies: Ensure that you have Python and fastText installed. You can install fastText via pip:

    pip install fasttext
    
  2. Download the Model: Use the hf_hub_download function from the huggingface_hub library to download the model file:

    from huggingface_hub import hf_hub_download
    model_path = hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model.bin")
    
  3. Load the Model: Load the model using fastText:

    import fasttext
    model = fasttext.load_model(model_path)
    
  4. Predict Educational Value: Define a function to predict the educational value of text inputs:

    def predict_educational_value(text_list):
        # (Function implementation as provided in the README)
        pass
    
  5. Run Predictions: Use the function to evaluate text samples.

For more efficient processing, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

The model is released under the MIT License, allowing for wide use and modification within the terms of the license.

More Related APIs in Text Classification