Section 01
[Introduction] OCRPolish: An LLM-Optimized OCR Post-Processing and Knowledge Base Toolkit
OCRPolish is an OCR post-processing toolkit written in Python, optimized for OCR outputs processed by LLMs. Its core features include cleaning OCR text with messy formatting, extracting metadata via local LLMs, generating Obsidian index pages, etc. Its design goal is to upgrade raw OCR outputs into structured knowledge bases, making it particularly suitable for Obsidian users, researchers, and archival digitization scenarios, bridging the gap between OCR outputs and usable knowledge.