Zing Forum

Reading

Multimodal Large Model OCR Fine-Tuning Practice: Analysis of the Combined Optimization Scheme of LoRA+GRPO+ICL

This project is an undergraduate graduation design that demonstrates how to use LoRA and GRPO technologies to fine-tune a multimodal large language model, and integrate ICL (In-Context Learning) during the inference phase to improve OCR task performance. Based on the Qwen3VL model and combined with CTW and CASIA datasets, the project provides a complete optimization scheme for multimodal OCR models.

LoRAGRPOICL多模态大模型OCRQwen3VL强化学习参数高效微调文本识别上下文学习
Published 2026-06-12 15:14Recent activity 2026-06-12 15:28Estimated read 5 min
Multimodal Large Model OCR Fine-Tuning Practice: Analysis of the Combined Optimization Scheme of LoRA+GRPO+ICL
1

Section 01

Multimodal Large Model OCR Fine-Tuning Practice: Guide to the Combined Optimization Scheme of LoRA+GRPO+ICL

This project is an undergraduate graduation design that demonstrates how to use LoRA (Low-Rank Adaptation) and GRPO (Group Relative Policy Optimization) technologies to fine-tune the multimodal large language model Qwen3VL, and integrate ICL (In-Context Learning) during the inference phase to improve OCR task performance. Combined with CTW and CASIA datasets, the project provides a complete optimization scheme for multimodal OCR models, and the technical combination forms an optimization loop from training to inference.

2

Section 02

Technical Background: Synergistic Effect of Three Core Technologies

The project's technical scheme is based on three core components: LoRA, GRPO, and ICL. LoRA reduces parameter consumption through low-rank matrix fine-tuning while preserving pre-trained knowledge; GRPO (an improved version of PPO) uses intra-group relative reward estimation as the baseline to reduce memory usage and optimize OCR recognition strategies; ICL adapts to specific scenarios through examples during inference. The three form a complete chain: LoRA efficient fine-tuning → GRPO reinforcement optimization → ICL inference enhancement.

3

Section 03

Model Architecture and Training Strategy

The base model is Qwen3VL (visual encoder + language model architecture). QLoRA quantized fine-tuning (16-bit floating point) is used, with LoRA configuration targeting attention layers (q/k/v/o_proj) and feed-forward network layers (gate/up/down_proj). The dataset uses 3000 samples each from CTW and CASIA, merged into a 6000-sample training set. Training configuration: mixed precision (fp16), batch size 2, gradient accumulation 4 (equivalent to batch size 8), learning rate 5e-5 (cosine annealing).

4

Section 04

Reward Function Design: Multi-Dimensional Quality Evaluation

GRPO training uses dual reward functions: accuracy reward (1.0 if output is exactly consistent with annotation, otherwise 0); edit distance reward (Levenshtein similarity with a weight of 0.5). The combination of the two: accuracy pursues final correctness, while edit distance provides a progressive optimization signal to assist model learning.

5

Section 05

ICL Inference Optimization: Value of Contextual Examples

ICL technology is integrated during the inference phase, and input examples (image-text pairs) help the model adapt to specific scenarios (such as printed/handwritten text, street view/documents). ICL and fine-tuning form a closed loop: fine-tuning masters basic capabilities, ICL quickly adapts to scenarios, improving flexibility.

6

Section 06

Technical Highlights and Innovations

  1. Systematic technical combination: LoRA, GRPO, and ICL form a complete optimization chain from training to inference; 2. Reward function design: combination of accuracy and edit distance, balancing final correctness and progressive optimization; 3. Meticulous data processing: image format conversion, dialogue template construction, system prompt design (specifying the OCR expert role to output only results).
7

Section 07

Application Scenarios and Limitations

Applicable scenarios: limited resources cannot support full fine-tuning, rapid adaptation to new scenarios/fonts, building dedicated OCR capabilities from general models. Limitations: GRPO requires designing appropriate reward functions; LoRA still needs CUDA memory; ICL effect depends on example selection; complete inference process and evaluation metrics need to be improved.