bert wiki paragraphs
dennlingerIntroduction
BERT-WIKI-PARAGRAPHS is a model designed for text classification, specifically for sentence similarity tasks. It leverages BERT architecture to determine whether paragraphs from Wikipedia belong to the same topic. The model is available on Hugging Face and can be used with the Transformers library.
Architecture
The model is based on the BERT architecture, specifically using the bert-base-uncased
variant. It processes input pairs of paragraphs and predicts whether they belong to the same topic. The model operates using PyTorch and is compatible with Safetensors for efficient deployment.
Training
Training of BERT-WIKI-PARAGRAPHS was conducted in a weakly-supervised manner using a dataset composed of Wikipedia articles. Paragraphs from the same section are labeled as topically coherent. The model training was carried out for three epochs with a batch size of 24, using gradient accumulation for an effective batch size of 48. The learning rate was set to 1e-4, with a gradient clipping at 5. A single Titan RTX GPU was utilized for training over three weeks.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install the Transformers library:
pip install transformers
-
Use the model in a Python script:
from transformers import pipeline pipe = pipeline("text-classification", model="dennlinger/bert-wiki-paragraphs") result = pipe("{First paragraph} [SEP] {Second paragraph}") print(result)
-
Interpret the output: A prediction of "1" indicates that the paragraphs are topically connected, while "0" suggests they are not.
For enhanced performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
The BERT-WIKI-PARAGRAPHS model is licensed under the MIT License, allowing for broad use and modification.