detr layout detection
cmarkeaIntroduction
The detr-layout-detection
model is designed for extracting various layout elements such as Text, Picture, Caption, and Footnote from document images. It is a fine-tuned version of the detr-resnet-50
model on the DocLayNet dataset. This model can jointly predict masks and bounding boxes, making it suitable for processing documents for ODQA systems.
Architecture
This model is based on the DETR (DEtection TRansformers) architecture, which integrates transformers for image segmentation tasks. It can identify 11 different layout entities, including Caption, Footnote, Formula, List-item, and more.
Training
The model is fine-tuned on the DocLayNet dataset, which contains diverse document layouts. Performance evaluation is conducted on 500 pages from this dataset, measuring both semantic segmentation and object detection capabilities using F1-score, Generalized Intersection over Union (GIoU), and accuracy metrics.
Performance Metrics:
- Semantic Segmentation: Utilizes the F1-score for pixel classification.
- Object Detection: Evaluated using GIoU and bounding box class accuracy.
Benchmark Results:
cmarkea/detr-layout-detection
: F1-score of 91.27, GIoU of 80.66, and accuracy of 90.46.- Compared with
cmarkea/dit-base-layout-detection
: Slightly lower performance.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Required Packages:
pip install transformers
-
Load the Model:
from transformers import AutoImageProcessor from transformers.models.detr import DetrForSegmentation img_proc = AutoImageProcessor.from_pretrained("cmarkea/detr-layout-detection") model = DetrForSegmentation.from_pretrained("cmarkea/detr-layout-detection")
-
Prepare and Process an Image:
img: PIL.Image with torch.inference_mode(): input_ids = img_proc(img, return_tensors='pt') output = model(**input_ids)
-
Post-Process Results:
threshold = 0.4 segmentation_mask = img_proc.post_process_segmentation(output, threshold=threshold, target_sizes=[img.size[::-1]]) bbox_pred = img_proc.post_process_object_detection(output, threshold=threshold, target_sizes=[img.size[::-1]])
Cloud GPU Recommendation: Consider using cloud-based services such as AWS EC2 with GPU instances, or Google Cloud's AI Platform for enhanced performance, especially with large datasets.
License
This project is licensed under the Apache-2.0 License, allowing for both personal and commercial use, distribution, and modification, subject to the license terms.