Section 01
Introduction: Core Innovations and Value of the Text-Aware VQA System
This article introduces the Text-Aware VQA project, which builds a text-aware visual question answering system integrating OCR and BLIP models, achieving efficient and accurate image-text understanding through question-guided filtering and multimodal fusion. Core innovations include deep integration of OCR and visual models, question-guided attention mechanism, and lightweight design supporting edge deployment. The system outperforms the baseline BLIP in accuracy (+9.4%), inference speed (+15%), and model size (-36%), and has wide applications in document intelligence, scene interaction, and educational assistance.