magiv2
ragavsachdevaIntroduction
MAGIV2 is a model designed for chapter-wide manga transcriptions with character names. The project is led by Ragav Sachdeva, Gyungin Shin, and Andrew Zisserman from the University of Oxford. It supports tasks such as object detection and optical character recognition (OCR) in the context of manga.
Architecture
The model is built using PyTorch and is capable of performing object detection, OCR, clustering, and diarisation. It processes manga pages to associate text with character names, allowing for a comprehensive transcription of manga chapters.
Training
The details of the training process are not specified in the documentation. The model is available in the Hugging Face model hub, where it can be loaded and used directly for predictions.
Guide: Running Locally
-
Prerequisites: Ensure you have Python installed, along with the required libraries such as
transformers
,torch
, andPIL
. -
Loading the Model:
from transformers import AutoModel model = AutoModel.from_pretrained("ragavsachdeva/magiv2", trust_remote_code=True).cuda().eval()
-
Preparing Images: Convert your manga pages and character images to RGB format for processing.
from PIL import Image import numpy as np def read_image(path_to_image): with open(path_to_image, "rb") as file: image = Image.open(file).convert("L").convert("RGB") image = np.array(image) return image chapter_pages = [read_image(x) for x in ["page1.png", "page2.png", "page3.png"]]
-
Performing Predictions: Use the model to perform chapter-wide predictions and generate transcripts.
import torch with torch.no_grad(): per_page_results = model.do_chapter_wide_prediction(chapter_pages, character_bank, use_tqdm=True, do_ocr=True) transcript = [] for i, (image, page_result) in enumerate(zip(chapter_pages, per_page_results)): model.visualise_single_image_prediction(image, page_result, f"page_{i}.png") speaker_name = { text_idx: page_result["character_names"][char_idx] for text_idx, char_idx in page_result["text_character_associations"] } for j in range(len(page_result["ocr"])): if not page_result["is_essential_text"][j]: continue name = speaker_name.get(j, "unsure") transcript.append(f"<{name}>: {page_result['ocr'][j]}")
-
Saving Transcripts:
with open(f"transcript.txt", "w") as fh: for line in transcript: fh.write(line + "\n")
Cloud GPUs: For efficient processing, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.
License
The model and datasets are available for personal, research, non-commercial, and not-for-profit use. For commercial purposes or other uses, contact the author through the provided website for licensing arrangements.