detr layout detection

cmarkea

Introduction

The detr-layout-detection model is designed for extracting various layout elements such as Text, Picture, Caption, and Footnote from document images. It is a fine-tuned version of the detr-resnet-50 model on the DocLayNet dataset. This model can jointly predict masks and bounding boxes, making it suitable for processing documents for ODQA systems.

Architecture

This model is based on the DETR (DEtection TRansformers) architecture, which integrates transformers for image segmentation tasks. It can identify 11 different layout entities, including Caption, Footnote, Formula, List-item, and more.

Training

The model is fine-tuned on the DocLayNet dataset, which contains diverse document layouts. Performance evaluation is conducted on 500 pages from this dataset, measuring both semantic segmentation and object detection capabilities using F1-score, Generalized Intersection over Union (GIoU), and accuracy metrics.

Performance Metrics:

  • Semantic Segmentation: Utilizes the F1-score for pixel classification.
  • Object Detection: Evaluated using GIoU and bounding box class accuracy.

Benchmark Results:

  • cmarkea/detr-layout-detection: F1-score of 91.27, GIoU of 80.66, and accuracy of 90.46.
  • Compared with cmarkea/dit-base-layout-detection: Slightly lower performance.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Required Packages:

    pip install transformers
    
  2. Load the Model:

    from transformers import AutoImageProcessor
    from transformers.models.detr import DetrForSegmentation
    
    img_proc = AutoImageProcessor.from_pretrained("cmarkea/detr-layout-detection")
    model = DetrForSegmentation.from_pretrained("cmarkea/detr-layout-detection")
    
  3. Prepare and Process an Image:

    img: PIL.Image
    with torch.inference_mode():
        input_ids = img_proc(img, return_tensors='pt')
        output = model(**input_ids)
    
  4. Post-Process Results:

    threshold = 0.4
    segmentation_mask = img_proc.post_process_segmentation(output, threshold=threshold, target_sizes=[img.size[::-1]])
    bbox_pred = img_proc.post_process_object_detection(output, threshold=threshold, target_sizes=[img.size[::-1]])
    

Cloud GPU Recommendation: Consider using cloud-based services such as AWS EC2 with GPU instances, or Google Cloud's AI Platform for enhanced performance, especially with large datasets.

License

This project is licensed under the Apache-2.0 License, allowing for both personal and commercial use, distribution, and modification, subject to the license terms.

More Related APIs in Image Segmentation