Zing Forum

Reading

KubriCount and HieraCount: Enabling AI to Precisely Count Targets of Any Granularity

The research team redefines open-world counting as multi-granularity counting, and solves the prompt-following failure problem of vision-language models (VLMs) in fine-grained counting through the KubriCount dataset and HieraCount model.

视觉语言模型多粒度计数目标计数KubriCountHieraCount细粒度理解
Published 2026-05-12 01:32Recent activity 2026-05-12 13:24Estimated read 5 min
KubriCount and HieraCount: Enabling AI to Precisely Count Targets of Any Granularity
1

Section 01

[Main Post/Introduction] KubriCount and HieraCount: Redefining Multi-Granularity Counting to Solve AI's Fine-Grained Counting Challenges

The research team redefines open-world counting as multi-granularity counting. To address the prompt-following failure issue of vision-language models (VLMs) in fine-grained counting, they propose the KubriCount dataset and HieraCount model, enabling precise counting of targets at any granularity.

2

Section 02

Background: Core Pain Point of AI Counting — Prompt-Following Failure Caused by Granularity Ambiguity

Counting tasks that are simple for humans are prone to errors for AI, because existing methods ignore the diversity of counting granularity (such as identity, attribute, instance, etc.). For example, different queries in the same scene ("count sheep" vs. "count white sheep") require different results, but existing systems cannot distinguish them accurately, leading to counting results that do not meet user expectations.

3

Section 03

Method 1: Multi-Granularity Counting Paradigm and KubriCount Dataset

The research proposes a new multi-granularity counting paradigm, defining a five-level granularity system (identity, attribute, instance, category, concept), and uses visual samples + fine-grained text dual modalities to define targets. To address the data bottleneck, the KubriCount dataset is developed: it uses a fully automated process (controllable 3D synthesis, consistent image editing, VLM filtering), and is the largest and most comprehensively annotated counting dataset, supporting multi-granularity training and evaluation.

4

Section 04

Method 2: Core Design of the HieraCount Model

The HieraCount model jointly uses text and visual samples as target specifications: the text channel parses fine-grained prompts to understand semantic intent, the visual channel extracts appearance features to establish a matching benchmark, and the fusion mechanism forms a unified target representation. This design enables the model to accurately understand fine-grained distinctions, handle complex scenes, and generalize to the real world.

5

Section 05

Experimental Evidence: Significant Performance Improvement of HieraCount

Benchmark tests show that existing models (multimodal large models, professional counting models) have severe prompt-following failures under fine-grained distinctions. HieraCount performs outstandingly: it achieves a significant increase in multi-granularity counting accuracy, strong generalization ability, and accurate prompt following. Key findings: existing models are poor at handling negative prompts; the introduction of visual samples improves accuracy; multi-granularity training enhances performance across all granularities.

6

Section 06

Conclusions and Applications: From Theoretical Breakthrough to Practical Scene Implementation

Theoretical contributions: Redefining open-world counting as a multi-granularity problem, proposing a fully automated data expansion process, and demonstrating model design principles for the joint use of multimodal information. Practical applications: Smart photo albums (fine-grained photo counting), industrial quality inspection (counting specific defects), medical imaging (cell/lesion counting), autonomous driving (scene object understanding), etc.

7

Section 07

Limitations and Future Directions: Room for Continuous Optimization

Current limitations: KubriCount is based on synthetic data, which has a gap with the real world; the five-level granularity system may not cover all scenarios; the computational cost is relatively high. Future directions: Expand real-world data, dynamic granularity learning, cross-modal expansion (video/3D), efficiency optimization.