keyphrase extraction kbir inspec
ml6teamKEYPHRASE EXTRACTION MODEL: KBIR-INSPEC
Introduction
Keyphrase extraction is a text analysis technique that extracts important keyphrases from documents, allowing humans to quickly understand content without reading it fully. Initially performed by human annotators, this task is now enhanced by AI and deep learning, which better capture semantic meanings than classical machine learning methods.
Architecture
The model uses KBIR (Keyphrase Boundary Infilling with Replacement) as its base, fine-tuned on the Inspec dataset. It employs a multi-task learning setup optimizing a combined loss from Masked Language Modeling, Keyphrase Boundary Infilling, and Keyphrase Replacement Classification. The model is a transformer fine-tuned for token classification, identifying words as part of a keyphrase or not.
Training
The model is trained on the Inspec dataset, comprising 2000 English scientific papers. Training parameters include a learning rate of 1e-4 over 50 epochs with early stopping. Preprocessing involves tokenization and label realignment for subword tokens. The model achieves notable performance improvements over state-of-the-art methods in keyphrase extraction.
Guide: Running Locally
- Installation: Ensure you have Python and necessary libraries installed, such as
transformers
anddatasets
. - Load Model: Use the following code to load and use the keyphrase extraction pipeline:
from transformers import TokenClassificationPipeline, AutoModelForTokenClassification, AutoTokenizer import numpy as np class KeyphraseExtractionPipeline(TokenClassificationPipeline): def __init__(self, model, *args, **kwargs): super().__init__( model=AutoModelForTokenClassification.from_pretrained(model), tokenizer=AutoTokenizer.from_pretrained(model), *args, **kwargs ) def postprocess(self, all_outputs): results = super().postprocess( all_outputs=all_outputs, aggregation_strategy=AggregationStrategy.SIMPLE, ) return np.unique([result.get("word").strip() for result in results]) model_name = "ml6team/keyphrase-extraction-kbir-inspec" extractor = KeyphraseExtractionPipeline(model=model_name) text = "Your text here" keyphrases = extractor(text) print(keyphrases)
- Hardware: Consider using cloud GPUs for faster processing, such as those available from AWS, Google Cloud, or Azure.
License
The model is released under the MIT License, allowing for commercial use, modification, distribution, and private use.