Section 01
Project Introduction: Production-Grade Visual Localization API Based on LLaVA
The visual-grounding-api project introduced in this article is a production-grade visual localization service based on LLaVA-1.5-7B and LoRA fine-tuning technology. Its core innovation lies in replacing text coordinate parsing with an MLP regression head, solving problems such as inconsistent formats and hallucinations in traditional solutions. On the RefCOCO test set, it improves IoU accuracy by 297.5% compared to the baseline. The project also provides a complete FastAPI service, an interactive demo interface, and a Docker containerization deployment solution, achieving the combination of academic research and engineering practice.