keyphrase extraction distilbert inspec

ml6team

Introduction

Keyphrase extraction is a text analysis technique that identifies important keyphrases from documents, allowing quick comprehension without fully reading the text. Initially performed by human annotators, the process is now enhanced by AI, utilizing machine learning and deep learning to better capture semantic meaning and context.

Architecture

The model utilizes DistilBERT as the base, fine-tuned on the Inspec dataset for keyphrase extraction. It classifies each word as part of a keyphrase or not using token classification. The model focuses on abstracts of scientific papers and is designed for English language documents.

Training

The Inspec dataset, containing 2000 scientific papers annotated with keyphrases, is used for training. Training involved preprocessing documents, tokenizing, and aligning labels with subword tokens. The model underwent 50 epochs with early stopping after 3 epochs of non-improvement. Evaluation metrics include precision, recall, and F1-score, showing competitive performance on keyphrase extraction tasks.

Guide: Running Locally

  1. Setup Environment

    • Install the transformers library:
      pip install transformers
      
  2. Load Model and Pipeline

    • Use the provided Python code to set up the keyphrase extraction pipeline:
      from transformers import TokenClassificationPipeline, AutoModelForTokenClassification, AutoTokenizer
      class KeyphraseExtractionPipeline(TokenClassificationPipeline):
          def __init__(self, model, *args, **kwargs):
              super().__init__(model=AutoModelForTokenClassification.from_pretrained(model),
                               tokenizer=AutoTokenizer.from_pretrained(model), *args, **kwargs)
      
      model_name = "ml6team/keyphrase-extraction-distilbert-inspec"
      extractor = KeyphraseExtractionPipeline(model=model_name)
      
  3. Inference

    • Pass text data to the pipeline to extract keyphrases:
      text = "Your text here."
      keyphrases = extractor(text)
      print(keyphrases)
      
  4. Cloud GPUs

    • For improved performance, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The model is licensed under the MIT License, allowing for broad usage and modification.

More Related APIs in Token Classification