Reading

KubriCount and HieraCount: Enabling AI to Precisely Count Targets of Any Granularity

The research team redefines open-world counting as multi-granularity counting, and solves the prompt-following failure problem of vision-language models (VLMs) in fine-grained counting through the KubriCount dataset and HieraCount model.

视觉语言模型多粒度计数目标计数KubriCountHieraCount细粒度理解

Published 2026-05-12 01:32Recent activity 2026-05-12 13:24Estimated read 5 min

KubriCount and HieraCount: Enabling AI to Precisely Count Targets of Any Granularity

Section 01

[Main Post/Introduction] KubriCount and HieraCount: Redefining Multi-Granularity Counting to Solve AI's Fine-Grained Counting Challenges

The research team redefines open-world counting as multi-granularity counting. To address the prompt-following failure issue of vision-language models (VLMs) in fine-grained counting, they propose the KubriCount dataset and HieraCount model, enabling precise counting of targets at any granularity.

Section 02

Background: Core Pain Point of AI Counting — Prompt-Following Failure Caused by Granularity Ambiguity

Counting tasks that are simple for humans are prone to errors for AI, because existing methods ignore the diversity of counting granularity (such as identity, attribute, instance, etc.). For example, different queries in the same scene ("count sheep" vs. "count white sheep") require different results, but existing systems cannot distinguish them accurately, leading to counting results that do not meet user expectations.

Section 03

Method 1: Multi-Granularity Counting Paradigm and KubriCount Dataset

The research proposes a new multi-granularity counting paradigm, defining a five-level granularity system (identity, attribute, instance, category, concept), and uses visual samples + fine-grained text dual modalities to define targets. To address the data bottleneck, the KubriCount dataset is developed: it uses a fully automated process (controllable 3D synthesis, consistent image editing, VLM filtering), and is the largest and most comprehensively annotated counting dataset, supporting multi-granularity training and evaluation.

Section 04

Method 2: Core Design of the HieraCount Model

The HieraCount model jointly uses text and visual samples as target specifications: the text channel parses fine-grained prompts to understand semantic intent, the visual channel extracts appearance features to establish a matching benchmark, and the fusion mechanism forms a unified target representation. This design enables the model to accurately understand fine-grained distinctions, handle complex scenes, and generalize to the real world.

Section 05

Experimental Evidence: Significant Performance Improvement of HieraCount

Benchmark tests show that existing models (multimodal large models, professional counting models) have severe prompt-following failures under fine-grained distinctions. HieraCount performs outstandingly: it achieves a significant increase in multi-granularity counting accuracy, strong generalization ability, and accurate prompt following. Key findings: existing models are poor at handling negative prompts; the introduction of visual samples improves accuracy; multi-granularity training enhances performance across all granularities.

Section 06

Conclusions and Applications: From Theoretical Breakthrough to Practical Scene Implementation

Theoretical contributions: Redefining open-world counting as a multi-granularity problem, proposing a fully automated data expansion process, and demonstrating model design principles for the joint use of multimodal information. Practical applications: Smart photo albums (fine-grained photo counting), industrial quality inspection (counting specific defects), medical imaging (cell/lesion counting), autonomous driving (scene object understanding), etc.

Section 07

Limitations and Future Directions: Room for Continuous Optimization

Current limitations: KubriCount is based on synthetic data, which has a gap with the real world; the five-level granularity system may not cover all scenarios; the computational cost is relatively high. Future directions: Expand real-world data, dynamic granularity learning, cross-modal expansion (video/3D), efficiency optimization.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15