roberta cls consec
dennlingerIntroduction
The roberta-cls-consec
model by Dennlinger is a fine-tuned RoBERTa-base model designed for topical change detection in documents. It identifies whether two text segments belong to the same topical section. The model was trained and evaluated as described in the paper "Topical Change Detection in Documents via Embeddings of Long Sequences" (arXiv:2012.03619).
Architecture
This model is based on the RoBERTa-base architecture, which is a transformer model known for its robust language representation capabilities.
Training
The model was specifically trained using a Terms of Service dataset, focusing on identifying whether two text segments (up to 512 tokens) belong to the same topical section. The training objective was to predict the coherence of two segments, helping to segment documents topically. The model labels segments as either belonging to the same topic (LABEL_1) or not (LABEL_0).
Guide: Running Locally
-
Installation: Ensure you have the
transformers
library installed:pip install transformers
-
Load the Model:
from transformers import pipeline pipe = pipeline("text-classification", model="dennlinger/roberta-cls-consec")
-
Inference: Use the model by providing two text segments separated by
[SEP]
:result = pipe("{First paragraph} [SEP] {Second paragraph}")
-
Environment: For efficient processing, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure that offer GPU instances.
License
The model and its associated code are released under the MIT License, allowing for wide usage and modification with proper attribution.