Introduction

MAGIV2 is a model designed for chapter-wide manga transcriptions with character names. The project is led by Ragav Sachdeva, Gyungin Shin, and Andrew Zisserman from the University of Oxford. It supports tasks such as object detection and optical character recognition (OCR) in the context of manga.

Architecture

The model is built using PyTorch and is capable of performing object detection, OCR, clustering, and diarisation. It processes manga pages to associate text with character names, allowing for a comprehensive transcription of manga chapters.

Training

The details of the training process are not specified in the documentation. The model is available in the Hugging Face model hub, where it can be loaded and used directly for predictions.

Guide: Running Locally

  1. Prerequisites: Ensure you have Python installed, along with the required libraries such as transformers, torch, and PIL.

  2. Loading the Model:

    from transformers import AutoModel
    model = AutoModel.from_pretrained("ragavsachdeva/magiv2", trust_remote_code=True).cuda().eval()
    
  3. Preparing Images: Convert your manga pages and character images to RGB format for processing.

    from PIL import Image
    import numpy as np
    
    def read_image(path_to_image):
        with open(path_to_image, "rb") as file:
            image = Image.open(file).convert("L").convert("RGB")
            image = np.array(image)
        return image
    
    chapter_pages = [read_image(x) for x in ["page1.png", "page2.png", "page3.png"]]
    
  4. Performing Predictions: Use the model to perform chapter-wide predictions and generate transcripts.

    import torch
    
    with torch.no_grad():
        per_page_results = model.do_chapter_wide_prediction(chapter_pages, character_bank, use_tqdm=True, do_ocr=True)
    
    transcript = []
    for i, (image, page_result) in enumerate(zip(chapter_pages, per_page_results)):
        model.visualise_single_image_prediction(image, page_result, f"page_{i}.png")
        speaker_name = {
            text_idx: page_result["character_names"][char_idx] for text_idx, char_idx in page_result["text_character_associations"]
        }
        for j in range(len(page_result["ocr"])):
            if not page_result["is_essential_text"][j]:
                continue
            name = speaker_name.get(j, "unsure") 
            transcript.append(f"<{name}>: {page_result['ocr'][j]}")
    
  5. Saving Transcripts:

    with open(f"transcript.txt", "w") as fh:
        for line in transcript:
            fh.write(line + "\n")
    

Cloud GPUs: For efficient processing, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

The model and datasets are available for personal, research, non-commercial, and not-for-profit use. For commercial purposes or other uses, contact the author through the provided website for licensing arrangements.

More Related APIs