Reading

Multimodal Large Model OCR Optimization Practice: Synergistic Application of LoRA, GRPO, and ICL

An OCR optimization solution for the Qwen3-VL-4B-based multimodal large model, combining LoRA fine-tuning, GRPO reinforcement learning, and in-context learning (ICL) technologies, achieves performance improvements in downstream OCR tasks across multiple public datasets.

多模态大模型OCRLoRAGRPO上下文学习Qwen3-VL强化学习参数高效微调

Published 2026-06-12 16:14Recent activity 2026-06-12 16:19Estimated read 9 min

Section 01

[Main Floor] Multimodal Large Model OCR Optimization Practice: Synergistic Application of LoRA, GRPO, and ICL

Core Viewpoint: An OCR optimization solution for the Qwen3-VL-4B-based multimodal large model, combining LoRA fine-tuning, GRPO reinforcement learning, and in-context learning (ICL) technologies, achieves performance improvements in downstream OCR tasks across multiple public datasets. The project supports multiple base models, provides a complete training-to-inference workflow, and can serve as a graduation project framework or research foundation.

Original Author and Source

Original Author/Maintainer: akjncjancj
Source Platform: GitHub
Original Title: bishe-sft
Original Link: https://github.com/akjncjancj/bishe-sft
Release Time: June 12, 2026

Section 02

Project Background and Challenges

With the rapid development of multimodal large language models (MLLM), the performance of general models in specific downstream tasks often fails to meet practical needs. As an interdisciplinary field of computer vision (CV) and natural language processing (NLP), optical character recognition (OCR) places extremely high demands on the model's multimodal understanding ability.

General multimodal large models have the following problems in specific OCR scenarios:

Insufficient domain adaptation: There is a distribution gap between general training data and real OCR scenarios
Limited fine-grained recognition capability: Low accuracy in recognizing small fonts, complex layouts, handwritten text, etc.
Trade-off between inference efficiency and precision: High inference cost of large models, requiring optimization of efficiency while maintaining precision

This project proposes a complete OCR optimization solution to address these challenges.

Section 03

Detailed Explanation of Core Technical Solutions

Based on the Qwen3-VL-4B model, three core technologies are used for synergistic optimization:

1. LoRA Fine-tuning Technology

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that achieves targeted enhancement by injecting low-rank matrices into Transformer layers. Advantages: low VRAM usage, fast training speed, reusable model. Fine-tuning is performed using the LLaMA-Factory framework.

2. GRPO Reinforcement Learning

GRPO (Group Relative Policy Optimization) is a reinforcement learning method for large models, which reduces reliance on the value network through intra-group relative advantage estimation. In OCR, it helps: learn stable output formats, optimize long text generation, and reduce hallucination phenomena.

3. In-context Learning (ICL)

Introduce example samples during the inference phase to guide output, achieving zero/few-shot performance improvement. Advantages: no additional training required, flexible adaptation to scenarios, complementary to fine-tuning to form a closed loop.

Section 04

Datasets and Evaluation Benchmarks

Four public OCR datasets are used for evaluation:

Dataset	Characteristics	Application Scenarios
CTW1500	Curved text detection	Curved text in natural scenes
ICDAR2013	Horizontal text recognition	Document scanning, printed text recognition
ICDAR2015	Multi-directional text	Street scenes, billboards, etc.
CASIA-HWDB2	Handwriting database	Chinese handwriting recognition

Covers multiple dimensions such as printed/handwritten text, horizontal/tilted text, Chinese/English, comprehensively evaluating OCR capabilities.

Section 05

Model Support and Extensibility

In addition to Qwen3-VL-4B, the following base models are supported:

Gemma-3-4B: Google's open-source multimodal model, lightweight and efficient
MiniCPM-V-2_6: FaceWall AI's edge-side multimodal model

The multi-model support design makes the project highly extensible, allowing selection of appropriate base models based on hardware and task requirements.

Section 06

Project Structure and Usage Value

The project adopts a modular design, including:

Data download scripts: Automatically download four OCR datasets from Hugging Face
Model acquisition tool: Supports downloading domestically accessible weights from the ModelScope mirror site
LoRA training configuration: Complete training configuration based on LLaMA-Factory
Evaluation scripts: Supports standardized evaluation across multiple datasets

It can be used as a complete framework for undergraduate graduation projects, or as a base code library for secondary development in OCR research.

Section 07

Technical Highlights and Insights

Core Insight: Multi-technology synergy is superior to single optimization.

LoRA solves training efficiency and resource usage issues, GRPO improves output stability and accuracy, and ICL optimizes inference effects without increasing training costs. The three form a complete optimization chain from training to inference.

It provides a reproducible and scalable technical solution for developers in the multimodal large model field, covering the complete workflow from environment setup, data preparation, model training to effect evaluation.

Section 08

Project Summary

OCR optimization for multimodal large models is a systems engineering that requires integrating model architecture, training strategies, and inference techniques. This project demonstrates how to improve the performance of specific downstream tasks while maintaining general capabilities through the synergistic application of LoRA, GRPO, and ICL.

For academic researchers: Understand practical cases of large model fine-tuning and reinforcement learning; For industrial developers: A directly deployable OCR optimization solution.