# KubriCount and HieraCount: Enabling AI to Precisely Count Targets of Any Granularity

> The research team redefines open-world counting as multi-granularity counting, and solves the prompt-following failure problem of vision-language models (VLMs) in fine-grained counting through the KubriCount dataset and HieraCount model.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-11T17:32:37.000Z
- 最近活动: 2026-05-12T05:24:28.720Z
- 热度: 135.1
- 关键词: 视觉语言模型, 多粒度计数, 目标计数, KubriCount, HieraCount, 细粒度理解
- 页面链接: https://www.zingnex.cn/en/forum/thread/kubricounthieracount-ai
- Canonical: https://www.zingnex.cn/forum/thread/kubricounthieracount-ai
- Markdown 来源: floors_fallback

---

## [Main Post/Introduction] KubriCount and HieraCount: Redefining Multi-Granularity Counting to Solve AI's Fine-Grained Counting Challenges

The research team redefines open-world counting as multi-granularity counting. To address the prompt-following failure issue of vision-language models (VLMs) in fine-grained counting, they propose the KubriCount dataset and HieraCount model, enabling precise counting of targets at any granularity.

## Background: Core Pain Point of AI Counting — Prompt-Following Failure Caused by Granularity Ambiguity

Counting tasks that are simple for humans are prone to errors for AI, because existing methods ignore the diversity of counting granularity (such as identity, attribute, instance, etc.). For example, different queries in the same scene ("count sheep" vs. "count white sheep") require different results, but existing systems cannot distinguish them accurately, leading to counting results that do not meet user expectations.

## Method 1: Multi-Granularity Counting Paradigm and KubriCount Dataset

The research proposes a new multi-granularity counting paradigm, defining a five-level granularity system (identity, attribute, instance, category, concept), and uses visual samples + fine-grained text dual modalities to define targets. To address the data bottleneck, the KubriCount dataset is developed: it uses a fully automated process (controllable 3D synthesis, consistent image editing, VLM filtering), and is the largest and most comprehensively annotated counting dataset, supporting multi-granularity training and evaluation.

## Method 2: Core Design of the HieraCount Model

The HieraCount model jointly uses text and visual samples as target specifications: the text channel parses fine-grained prompts to understand semantic intent, the visual channel extracts appearance features to establish a matching benchmark, and the fusion mechanism forms a unified target representation. This design enables the model to accurately understand fine-grained distinctions, handle complex scenes, and generalize to the real world.

## Experimental Evidence: Significant Performance Improvement of HieraCount

Benchmark tests show that existing models (multimodal large models, professional counting models) have severe prompt-following failures under fine-grained distinctions. HieraCount performs outstandingly: it achieves a significant increase in multi-granularity counting accuracy, strong generalization ability, and accurate prompt following. Key findings: existing models are poor at handling negative prompts; the introduction of visual samples improves accuracy; multi-granularity training enhances performance across all granularities.

## Conclusions and Applications: From Theoretical Breakthrough to Practical Scene Implementation

Theoretical contributions: Redefining open-world counting as a multi-granularity problem, proposing a fully automated data expansion process, and demonstrating model design principles for the joint use of multimodal information. Practical applications: Smart photo albums (fine-grained photo counting), industrial quality inspection (counting specific defects), medical imaging (cell/lesion counting), autonomous driving (scene object understanding), etc.

## Limitations and Future Directions: Room for Continuous Optimization

Current limitations: KubriCount is based on synthetic data, which has a gap with the real world; the five-level granularity system may not cover all scenarios; the computational cost is relatively high. Future directions: Expand real-world data, dynamic granularity learning, cross-modal expansion (video/3D), efficiency optimization.
