Technical Architecture and Implementation Mechanism
The MStar framework consists of three key components:
Regional Parsing Module
Converts users' natural language descriptions into structured regional queries, combining visual features and language understanding to identify implicit spatial relationships and constraints (e.g., parsing "the area to the right of the red object in the upper left corner").
External Reasoning Engine
The core innovation point, which maintains explicit reasoning states (recording identified regions, hypotheses to be verified, and reasoning chains), supports rule-based logical reasoning, similarity matching reasoning, and contextual semantic reasoning. Each step of reasoning can be tracked and verified.
Iterative Verification Mechanism
Cross-validates key steps before generating the final answer, detects contradictions or inconsistencies, and reduces the probability of hallucination.