Introduction

MinerU is a model designed to convert PDF documents into Markdown format, supporting both Chinese and English languages. It performs tasks such as text layout analysis, mathematical formula recognition, and table structure reconstruction.

Architecture

MinerU employs a multi-model architecture, combining different models for specific tasks:

  • Layout: Document layout analysis using Detectron2.
  • MFD: Mathematical formula detection with a custom CNN.
  • MFR: Mathematical formula recognition using a BERT-based model.
  • TabRec: Table recognition and reconstruction with a T5-based approach.

Training

The model was trained using datasets of academic papers, textbooks, and technical documents. The training process involved:

  1. Pre-training of individual sub-models.
  2. Joint training for optimization.
  3. End-to-end fine-tuning.

Evaluation results indicate:

  • Text recognition accuracy: 95%
  • Formula recognition accuracy: 90%
  • Table reconstruction accuracy: 85%

Guide: Running Locally

To run MinerU locally, follow these steps:

  1. Set up the environment with the required hardware and software:
    • Hardware Requirements:
      • RAM: 16GB+
      • GPU: 8GB+ VRAM (Consider using cloud services like AWS or Google Cloud for GPU access)
      • Storage: 5GB
    • Software Requirements:
      • Python >= 3.7
      • PyTorch >= 1.9.0
      • transformers library >= 4.28.0
      • detectron2
  2. Use the following Python code snippet to convert a PDF to Markdown:
    from transformers import pipeline
    
    converter = pipeline("document-conversion", model="kitjesen/MinerU")
    markdown = converter("document.pdf")
    

License

MinerU is licensed under the Apache-2.0 License, allowing for both personal and commercial use.

More Related APIs in Feature Extraction