keyphrase extraction kbir inspec

ml6team

KEYPHRASE EXTRACTION MODEL: KBIR-INSPEC

Introduction

Keyphrase extraction is a text analysis technique that extracts important keyphrases from documents, allowing humans to quickly understand content without reading it fully. Initially performed by human annotators, this task is now enhanced by AI and deep learning, which better capture semantic meanings than classical machine learning methods.

Architecture

The model uses KBIR (Keyphrase Boundary Infilling with Replacement) as its base, fine-tuned on the Inspec dataset. It employs a multi-task learning setup optimizing a combined loss from Masked Language Modeling, Keyphrase Boundary Infilling, and Keyphrase Replacement Classification. The model is a transformer fine-tuned for token classification, identifying words as part of a keyphrase or not.

Training

The model is trained on the Inspec dataset, comprising 2000 English scientific papers. Training parameters include a learning rate of 1e-4 over 50 epochs with early stopping. Preprocessing involves tokenization and label realignment for subword tokens. The model achieves notable performance improvements over state-of-the-art methods in keyphrase extraction.

Guide: Running Locally

  1. Installation: Ensure you have Python and necessary libraries installed, such as transformers and datasets.
  2. Load Model: Use the following code to load and use the keyphrase extraction pipeline:
    from transformers import TokenClassificationPipeline, AutoModelForTokenClassification, AutoTokenizer
    import numpy as np
    
    class KeyphraseExtractionPipeline(TokenClassificationPipeline):
        def __init__(self, model, *args, **kwargs):
            super().__init__(
                model=AutoModelForTokenClassification.from_pretrained(model),
                tokenizer=AutoTokenizer.from_pretrained(model),
                *args,
                **kwargs
            )
        def postprocess(self, all_outputs):
            results = super().postprocess(
                all_outputs=all_outputs,
                aggregation_strategy=AggregationStrategy.SIMPLE,
            )
            return np.unique([result.get("word").strip() for result in results])
    
    model_name = "ml6team/keyphrase-extraction-kbir-inspec"
    extractor = KeyphraseExtractionPipeline(model=model_name)
    text = "Your text here"
    keyphrases = extractor(text)
    print(keyphrases)
    
  3. Hardware: Consider using cloud GPUs for faster processing, such as those available from AWS, Google Cloud, or Azure.

License

The model is released under the MIT License, allowing for commercial use, modification, distribution, and private use.

More Related APIs in Token Classification