fineweb edu classifier
HuggingFaceFWIntroduction
The FineWeb-Edu Classifier is a tool designed to evaluate the educational value of web pages. It was developed to filter and curate educational content from web datasets, utilizing annotations from the Llama3 model.
Architecture
The classifier uses the Snowflake-arctic-embed model with an added classification head for regression output. The model's architecture is primarily aimed at capturing the educational quality of web samples, which were annotated by the Llama3-70B-Instruct model.
Training
The model was trained on a dataset of 450,000 annotated web samples. These annotations scored the educational value from 0 (not educational) to 5 (highly educational). The model was trained over 20 epochs with a learning rate of 3e-4, utilizing a frozen embedding and encoder layer to focus on the classification head. It achieved an F1 score of 82% when treated as a binary classifier with a score threshold of 3.
Training Details:
- Model: Snowflake-arctic-embed with classification head
- Dataset: 450,000 samples from Llama3 annotations
- Epochs: 20
- Learning Rate: 3e-4
- Evaluation Metric: F1 score
Guide: Running Locally
To use the FineWeb-Edu classifier with the Transformers library, follow these steps:
- Install Dependencies: Ensure you have the
transformers
library installed. - Load the Model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/fineweb-edu-classifier") model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceTB/fineweb-edu-classifier")
- Prepare the Input:
text = "This is a test sentence." inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
- Run Inference:
outputs = model(**inputs) logits = outputs.logits.squeeze(-1).float().detach().numpy() score = logits.item() result = { "text": text, "score": score, "int_score": int(round(max(0, min(score, 5)))), } print(result) # {'text': 'This is a test sentence.', 'score': 0.07964489609003067, 'int_score': 0}
For optimal performance, consider using cloud GPUs like AWS EC2 or Google Cloud's AI Platform.
License
The FineWeb-Edu Classifier is released under the Apache 2.0 License.