Introduction

The British Library Books Genre Detector is a fine-tuned model based on distilbert-base-cased, developed to classify books as fiction or non-fiction. Focused on the British Library's Digitised Printed Books from the 18th-19th century, the model utilizes metadata from various titles to segment this large dataset into basic genre categories.

Architecture

The model utilizes the distilbert-base-cased architecture, which is efficient for text classification tasks. It supports multilingual inputs, with training data primarily in English but including titles in several other languages.

Training

The training data was sourced from the British Library's collections and expanded using Snorkel, an automated labeling tool. The training process was implemented using the blurr library, with detailed procedures available in the project's GitHub repository. Evaluation showed a high accuracy of 94% on a held-out dataset, though caution is advised due to potential biases.

Guide: Running Locally

To run the model locally, use the Hugging Face Transformers library as follows:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("davanstrien/bl-books-genre")
model = AutoModelForSequenceClassification.from_pretrained("davanstrien/bl-books-genre")
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

result = classifier("Oliver Twist")
print(result)  # Output: [{'label': 'Fiction', 'score': 0.9980145692825317}]

For enhanced performance, consider using cloud GPUs such as AWS EC2 instances or Google Cloud's AI Platform.

License

The model is licensed under the MIT License, permitting wide use and modification with attribution.

More Related APIs in Text Classification