docling models

ds4sd

Introduction

The Docling Models power the PDF document conversion package, Docling. These models focus on analyzing and converting document layouts and table structures from images.

Architecture

Layout Model

The layout model utilizes the RT-DETR model to identify and classify various layout components within a document image. Detected components include Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title. The model's performance is benchmarked against human evaluation and standard object detection methods using the DocLayNet dataset.

TableFormer

TableFormer focuses on identifying table structures from images, leveraging predicted table regions from the layout model. It achieves state-of-the-art performance in table structure identification, outperforming other methods like Tabula, Traprange, Camelot, Acrobat Pro, and EDD.

Training

The training processes for the layout model and TableFormer are based on datasets like DocLayNet, which includes human-annotated document layouts. These models are evaluated on their ability to match or exceed human performance and other standard methods.

Guide: Running Locally

  1. Clone the Repository: Start by cloning the Docling repository from GitHub [https://github.com/DS4SD/docling].
  2. Set Up Environment: Install the necessary dependencies and set up the environment, preferably using a virtual environment.
  3. Download Model Weights: Obtain the pre-trained model weights if necessary.
  4. Run Inference: Use the provided scripts to run inference on sample document images.

Cloud GPUs: For resource-intensive tasks, consider using cloud GPU services like AWS, Google Cloud, or Azure to accelerate processing.

License

The Docling Models are released under the CDLA-Permissive-2.0 license, allowing for free use, modification, and distribution with minimal restrictions.

More Related APIs