Zing Forum

Reading

Multi-Modal Manga Translation Pipeline: An End-to-End Automatic Japanese Manga Translation System Combining CV, OCR, and Large Models

This project is an end-to-end machine learning pipeline that automates the entire process of Japanese manga detection, text extraction, translation, and typesetting by combining YOLOv8 speech bubble detection, MangaOCR Japanese text extraction, Ollama large model translation, and a custom typesetting engine.

漫画翻译OCRYOLOv8大语言模型多模态计算机视觉Qwen自动化
Published 2026-05-08 06:10Recent activity 2026-05-08 10:14Estimated read 6 min
Multi-Modal Manga Translation Pipeline: An End-to-End Automatic Japanese Manga Translation System Combining CV, OCR, and Large Models
1

Section 01

Introduction / Main Floor: Multi-Modal Manga Translation Pipeline: An End-to-End Automatic Japanese Manga Translation System Combining CV, OCR, and Large Models

This project is an end-to-end machine learning pipeline that automates the entire process of Japanese manga detection, text extraction, translation, and typesetting by combining YOLOv8 speech bubble detection, MangaOCR Japanese text extraction, Ollama large model translation, and a custom typesetting engine.

2

Section 02

Pain Points of Manga Translation: From Manual to Automated

Traditional manga translation is a labor-intensive task that requires translators to manually perform multiple steps such as speech bubble detection, text extraction, translation, and typesetting. A typical manga chapter may contain dozens of pages, each with multiple dialogue boxes, and the entire process takes hours or even days. For scanlation groups and small publishers, this inefficiency severely limits their output capacity.

More importantly, translation quality depends not only on the accuracy of language conversion but also on maintaining character tone and narrative coherence. When multiple translators collaborate, consistency in terminology and character names is often difficult to ensure, which affects the reading experience.

3

Section 03

Project Overview: Fully Automated Translation Pipeline

Multi-Modal-Manga-Translation-Pipeline is an end-to-end machine learning pipeline that automatically completes the entire process of Japanese manga detection, extraction, translation, and typesetting by combining computer vision, OCR, and large language models. The system can batch process entire manga chapters, maintain narrative context across pages, and generate coherent translation results.

The core innovation of the project lies in integrating multiple specialized AI components into a unified processing flow, where each component handles a specific task and works collaboratively to achieve high-quality automated translation.

4

Section 04

System Architecture: Four-Stage Processing Flow

The pipeline adopts a modular four-stage architecture:

5

Section 05

Stage 1: Speech Bubble Detection (YOLOv8)

The YOLOv8 model is used to detect the positions of speech bubbles in manga pages. This model is specifically trained for manga layouts and can recognize speech bubbles of various shapes and sizes, including overlapping and edge cases. The system implements an adaptive confidence threshold: if no bubbles are detected, it automatically lowers the confidence and retries.

6

Section 06

Stage 2: Text Extraction (MangaOCR)

After detecting the bubbles, MangaOCR is used to extract the Japanese text from them. MangaOCR is an OCR model optimized specifically for Japanese manga, capable of handling manga-specific fonts, layouts, and background interference.

7

Section 07

Stage 3: Context-Aware Translation (Ollama + Qwen 2.5)

The extracted Japanese text is translated using the Qwen 2.5 large model deployed locally via Ollama. Unlike traditional machine translation, this system achieves context-aware translation through the following mechanisms:

  • Batch Processing: Translate 3-4 pages at once to maintain dialogue coherence
  • Series Metadata Integration: Use title, genre, and description to adjust tone and terminology
  • Custom Translation Dictionary: Ensure consistency of character names and terminology throughout the chapter
  • Fallback Mechanism: Retry translation for failed content individually, and convert untranslatable Japanese text to Romaji
8

Section 08

Stage 4: Intelligent Typesetting Engine

The translated English text is rendered back into the bubbles via a custom typesetting engine:

  • Dynamic Font Size: Automatically adjust text size based on bubble dimensions
  • Intelligent Text Wrapping: Use pyphen for hyphenation to avoid awkward line breaks
  • Gaussian Blur Cleaning: Create a semi-transparent effect instead of a rigid white block
  • Outlined Text: Ensure readability on different backgrounds
  • Font Caching: Optimize real-time processing performance