# MinerU: An Open-Source Tool for Converting Complex Documents into LLM-Friendly Formats

> This article introduces MinerU, an open-source document parsing tool that converts complex documents like PDFs, images, and DOCX files into machine-readable Markdown and JSON formats. It supports formula recognition, table extraction, OCR, and other features, making it an ideal data preprocessing tool for building Agent workflows.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-30T17:44:55.000Z
- 最近活动: 2026-03-30T17:53:07.177Z
- 热度: 159.9
- 关键词: 文档解析, PDF, OCR, Markdown, LLM, Agent, 表格识别, 公式识别
- 页面链接: https://www.zingnex.cn/en/forum/thread/mineru-llm
- Canonical: https://www.zingnex.cn/forum/thread/mineru-llm
- Markdown 来源: floors_fallback

---

## MinerU: An Open-Source Tool for Converting Complex Documents into LLM-Friendly Formats

MinerU is an open-source document parsing tool designed to solve the document structuring challenges in the LLM and Agent era. It supports multi-format inputs such as PDF, image, and DOCX, and can convert them into Markdown/JSON formats. With core features like formula recognition, table extraction, and OCR, it is an ideal preprocessing tool for building Agent workflows and RAG systems.

## Project Origin and Background

MinerU was born during the pre-training process of the InternLM large model, focusing on solving the symbol conversion problem of scientific literature. Compared to commercial products, its open-source nature and rapid iteration make it an important player in the document parsing field.

## Comprehensive Core Features

- **Multi-format input**: Supports PDF, image, and DOCX; v3.0's native DOCX parsing speed is dozens of times faster;
- **Intelligent content extraction**: Automatically removes redundant elements such as headers and footers, outputs content in reading order, and preserves structural hierarchy;
- **Formula and table recognition**: Converts formulas to LaTeX and tables to HTML; supports images/formulas in tables and inter-line formula numbering;
- **OCR and multi-language support**: Supports 109 languages; added vertical text and seal recognition.

## Technical Architecture and Performance Upgrade

- **Dual-backend design**: Pipeline backend (CPU-supported, requires 4GB VRAM, OmniDocBench score: 86.2); VLM backend (accuracy over 90 points, requires 8GB+ VRAM);
- **v3.0 upgrade highlights**: Architecture optimization (Pipeline accuracy exceeds the previous VLM), API/CLI orchestration, asynchronous tasks, multi-GPU deployment, memory optimization, thread safety;
- **License cleanup**: Removed AGPLv3 and CC-BY-NC-SA models, making licenses more friendly.

## Deployment and Usage Guide

- **Installation**: pip installation (`uv pip install -U mineru[all]`), source code installation, Docker deployment;
- **Usage methods**: CLI (GPU/CPU mode), FastAPI, Gradio WebUI (official online version, ModelScope, HuggingFace Spaces).

## Application Scenarios and Value

- **RAG systems**: Convert PDF libraries into structured Markdown to improve retrieval accuracy;
- **Training data preparation**: Batch process academic papers/reports and output clean training text;
- **Agent workflows**: JSON output is suitable for integration; API calls support real-time document parsing.

## Current Limitations and Future Directions

**Limitations**: Reading order may be incorrect for extremely complex layouts; limited support for vertical text; no code block support; poor parsing of special formats (comics/textbooks); incorrect row/column recognition in complex tables;
**Future**: The team will continue to improve; community feedback via GitHub is welcome.

## Conclusion: Evolution from Tool to Infrastructure

MinerU is evolving from an independent tool to a large-scale document parsing infrastructure. v3.0 reduces resource consumption while maintaining high accuracy, supporting multi-GPU deployment and load balancing. For LLM application developers, it is an open-source tool worth trying. The project uses the AGPLv3 license, and related papers such as MinerU-Diffusion have been published.