bert wiki paragraphs

dennlinger

Introduction

BERT-WIKI-PARAGRAPHS is a model designed for text classification, specifically for sentence similarity tasks. It leverages BERT architecture to determine whether paragraphs from Wikipedia belong to the same topic. The model is available on Hugging Face and can be used with the Transformers library.

Architecture

The model is based on the BERT architecture, specifically using the bert-base-uncased variant. It processes input pairs of paragraphs and predicts whether they belong to the same topic. The model operates using PyTorch and is compatible with Safetensors for efficient deployment.

Training

Training of BERT-WIKI-PARAGRAPHS was conducted in a weakly-supervised manner using a dataset composed of Wikipedia articles. Paragraphs from the same section are labeled as topically coherent. The model training was carried out for three epochs with a batch size of 24, using gradient accumulation for an effective batch size of 48. The learning rate was set to 1e-4, with a gradient clipping at 5. A single Titan RTX GPU was utilized for training over three weeks.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Use the model in a Python script:

    from transformers import pipeline
    pipe = pipeline("text-classification", model="dennlinger/bert-wiki-paragraphs")
    result = pipe("{First paragraph} [SEP] {Second paragraph}")
    print(result)
    
  3. Interpret the output: A prediction of "1" indicates that the paragraphs are topically connected, while "0" suggests they are not.

For enhanced performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

The BERT-WIKI-PARAGRAPHS model is licensed under the MIT License, allowing for broad use and modification.

More Related APIs in Text Classification