Reading

StreamDyCoke: Enabling True Real-Time Streaming Inference for Video Large Language Models

StreamDyCoke is a streaming extension of the CVPR 2025 paper DyCoke. Using causal sliding-window temporal token merging and bounded dynamic pruning cache techniques, it enables video large language models (Video LLMs) to perform inference in real-time streaming scenarios, suitable for applications like AR glasses, robot perception, and assistive vision.

Video LLM视频大语言模型令牌压缩流式推理实时 AIDyCoke注意力机制缓存策略计算机视觉高效推理

Published 2026-04-30 04:11Recent activity 2026-04-30 04:23Estimated read 7 min

StreamDyCoke: Enabling True Real-Time Streaming Inference for Video Large Language Models

Section 01

[Main Floor/Introduction] StreamDyCoke: A Key Breakthrough Enabling Real-Time Streaming Inference for Video Large Language Models

StreamDyCoke is a streaming extension of the CVPR 2025 paper DyCoke. Through core technologies like causal sliding-window temporal token merging and bounded dynamic pruning cache, it addresses the pain point of existing Video LLMs requiring offline processing of entire videos, enabling real-time streaming inference. This technology is suitable for real-time application scenarios such as AR glasses, robot perception, and assistive vision, opening up new paths for the practical deployment of video large models.

Section 02

Technical Background: Core Challenges in Real-Time Video Large Models

Explosive Growth of Video Data

Video data leads to a surge in token count due to the time dimension (e.g., a 1-minute 30fps video contains 1800 frames, each with hundreds of tokens), bringing two major issues:

Computational complexity: The Transformer self-attention mechanism's complexity is proportional to the square of the token count; token explosion causes a sharp increase in computation;
Memory usage: KV cache grows infinitely with video length, easily exhausting memory.

Limitations of Existing Solutions

DyCoke performs well as an offline token compression method, but it assumes the entire video is available in advance, and its symmetric window design relies on future frames, making it unsuitable for real-time streaming scenarios.

Section 03

Core Innovations: Three Technical Breakthroughs Enabling Streaming Inference

Causal Sliding-Window Temporal Token Merging (Causal Sliding-Window TTM)：Only accesses historical frames; new frames are merged with tokens from past frames to ensure streaming feasibility;
Bounded Dynamic Pruning Cache (Bounded DP Cache)：Sets an upper limit on cache capacity, supporting three eviction strategies: FIFO, LRR, and DECAY (DECAY retains high-priority tokens based on attention scores);
Anytime Answering：Can generate partial answers at any frame boundary without re-pre-filling, meeting real-time feedback needs.

Section 04

Experimental Evidence: DECAY Strategy Excels in Token Quality

Experiments on a 32-frame synthetic video stream (with settings like cache capacity 64, active capacity 24) show:

TTM compression rate is consistent (74%) regardless of cache strategy;
DECAY strategy has significant advantages: average attention score of 0.83 (vs. 0.50 for FIFO/LRR), average token survival frames of 5.25 (vs.2.65 for FIFO/LRR);
Strategy trade-off: DECAY has higher quality, but FIFO/LRR are simpler to implement and have lower overhead.

Section 05

Application Scenarios: Real-Time Streaming Inference Empowers Multi-Domain Applications

StreamDyCoke's technical breakthroughs bring new possibilities to the following fields:

Assistive Vision: Real-time environment description for visually impaired devices;
Robot Perception: Real-time scene understanding for autonomous robots during movement;
AR Glasses: Timely overlay of digital information in response to environmental changes;
Video Surveillance: Reducing computational costs for real-time analysis;
Remote Operation: Low-latency image understanding for remote surgery/driving.

Section 06

Project Background and Future Plans

Project Background

StreamDyCoke is a course project for ITCS 6010/8010 at the University of North Carolina at Charlotte, reflecting the transformation from academic research to engineering practice, and following open science principles (code and data are public).

Future Roadmap

Short-term: Completed core functions like causal TTM, bounded cache, and streaming loop;
Mid-term: Plan to reproduce the DyCoke baseline on LLaVA-OneVision-7B and conduct streaming evaluation on the Ego4D-QA dataset;
Long-term: Ablation experiments with real attention data and completion of the final research report.

Section 07

Technical Insights and Industry Significance: A Key Step Toward Real-Time Video Intelligence

Technical Insights

Algorithm adaptation: Need to modify paper algorithms to adapt to practical scenarios (e.g., symmetric → causal window);
Cache optimization: Intelligent eviction strategies (like DECAY) can significantly improve performance in resource-constrained scenarios;
General value: Solutions like sliding windows and bounded cache can be applied to other streaming AI systems.

Conclusion

StreamDyCoke represents an important evolution of video large models from offline to real-time, providing an example of efficient AI system design for developers and researchers. We look forward to its full evaluation on real Video LLMs and application innovations.