llm data textbook quality fasttext classifier v2
kenhktsuiIntroduction
The LLM-DATA-TEXTBOOK-QUALITY-FASTTEXT-CLASSIFIER-V2 model is a fastText-based tool designed to classify the educational value of web text. It categorizes text into three levels: High, Mid, and Low educational value, offering a granular approach to data curation for large language model (LLM) training.
Architecture
The model is built using fastText, allowing it to classify over 2000 examples per second using a CPU. This efficiency makes it suitable for on-the-fly data filtering during pretraining. The classifier distinguishes between High, Mid, and Low educational value, enhancing the granularity of text evaluation.
Training
The model is trained using raw web text data, focusing on educational content. It draws inspiration from the "Textbooks Are All You Need" approach, aiming to identify and filter high educational value data for LLM pretraining. The classifier's effectiveness is validated through various benchmarks and comparisons with synthetic and real datasets.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Dependencies: Ensure that you have Python and fastText installed. You can install fastText via pip:
pip install fasttext
-
Download the Model: Use the
hf_hub_download
function from thehuggingface_hub
library to download the model file:from huggingface_hub import hf_hub_download model_path = hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model.bin")
-
Load the Model: Load the model using fastText:
import fasttext model = fasttext.load_model(model_path)
-
Predict Educational Value: Define a function to predict the educational value of text inputs:
def predict_educational_value(text_list): # (Function implementation as provided in the README) pass
-
Run Predictions: Use the function to evaluate text samples.
For more efficient processing, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
The model is released under the MIT License, allowing for wide use and modification within the terms of the license.