Zing Forum

Reading

STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

The STC framework proposed by the EPIC Lab at Shanghai Jiao Tong University provides plug-and-play acceleration for streaming video large language models via hierarchical token compression technology. It significantly reduces inference latency while maintaining 99% accuracy and has been accepted by CVPR 2026.

CVPR 2026视频大语言模型流式视频Token压缩推理加速上海交通大学计算机视觉深度学习GitHub开源
Published 2026-06-04 23:44Recent activity 2026-06-04 23:50Estimated read 8 min
STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression
1

Section 01

Introduction / Main Floor: STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

The STC framework proposed by the EPIC Lab at Shanghai Jiao Tong University provides plug-and-play acceleration for streaming video large language models via hierarchical token compression technology. It significantly reduces inference latency while maintaining 99% accuracy and has been accepted by CVPR 2026.

2

Section 02

Original Authors and Source

  • Original Author/Maintainer: EPIC Lab, SJTU (Shanghai Jiao Tong University Intelligent Computing Laboratory)
  • Source Platform: GitHub
  • Original Title: STC: Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
  • Original Link: https://github.com/lern-to-write/STC
  • Release Date: June 4, 2026
  • Paper Link: https://arxiv.org/abs/2512.00891

3

Section 03

Research Background and Problem Definition

Video Large Language Models (Video LLMs) are developing rapidly, but streaming video understanding scenarios face severe performance challenges. In applications like live streaming, AR glasses, and long-term surveillance, video frames arrive continuously. Traditional methods require complete visual encoding and LLM pre-filling for each frame, leading to accumulated latency that is hard to meet real-time requirements.

Core difficulties of streaming video understanding:

  1. Computational Redundancy: There is significant temporal redundancy between adjacent video frames; fully re-encoding each frame is inefficient
  2. Sequence Inflation: The number of tokens in long video sequences is huge, and LLM pre-filling time grows linearly with sequence length
  3. Real-Time Constraints: Latency-sensitive scenarios require millisecond-level responses, making traditional batch processing methods unsuitable

4

Section 04

Core Innovations of STC

STC (Streaming Token Compression) is the first plug-and-play inference acceleration framework for streaming video understanding, and it has been accepted by CVPR 2026. Its core innovations are reflected in two aspects:

5

Section 05

1. STC-Cacher: Intelligent Visual Token Caching

STC-Cacher leverages the temporal redundancy of video to selectively recompute only the dynamically changing visual tokens in each frame, while reusing the rest from the cache.

Technical Mechanism:

  • By comparing the visual features of the current frame with the reference frame, identify the spatial regions that have changed
  • Re-encode only the tokens in the changed regions; directly reuse the cache for static regions
  • Set a complete reference frame every N frames to balance cache efficiency and drift accumulation

Performance Gain: On the ReKV framework, ViT encoding latency is reduced by 24.5%

6

Section 06

2. STC-Pruner: Hierarchical Token Compression

After visual encoding is completed, STC-Pruner compresses the token sequence to reduce the sequence length for LLM pre-filling while preserving spatiotemporal saliency.

Technical Features:

  • Select the most important tokens based on visual saliency
  • Configurable per-frame token budget (e.g., 64 vs 196 full tokens)
  • Works in collaboration with Cacher to form a hierarchical compression pipeline

Performance Gain: On the ReKV framework, LLM pre-filling latency is reduced by 45.3% while maintaining up to 99% of the original accuracy


7

Section 07

Framework Compatibility and Integration

STC is designed as a model-agnostic core component that can be quickly integrated into mainstream streaming VideoLLM frameworks:

Framework Vision Tower STC-Cacher STC-Pruner Status
ReKV SigLIP (LLaVA-OneVision) Reference Implementation
StreamForest SigLIP Per-frame Streaming Cache
Dispider CLIP Per-frame Streaming Cache
LiveCC 🔜 🔜 Integration in Progress

Integration Methods:

  • STC-Cacher is attached to any HuggingFace pre-LN CLIP/SigLIP vision tower via a single line of monkey-patch
  • STC-Pruner is an explicit call that performs token compression before LLM pre-filling

8

Section 08

Main Experimental Results (ReKV Framework)

On the OVO-Bench and StreamingBench benchmarks, STC outperforms the baseline and other compression methods:

Method OVO Real-Time OVO Retrospective OVO Prospective StreamingBench ViT Encoding Latency LLM Pre-filling Latency
ReKV Baseline 64.4 64.6 52.6 69.1 103.7s 482.4s
+ ToMe 53.1 60.7 46.4 59.4 70.5s (↓32%) 257.8s (↓46.6%)
+ VisionZip 53.8 58.4 47.5 60.4 103.7s 258.3s (↓46.5%)
+ VidCom² 60.4 59.0 50.4 63.6 103.7s 259.1s (↓46.3%)
+ STC 62.5 63.3 52.0 65.2 78.3s (↓24.5%) 263.7s (↓45.3%)

Key Findings:

  • STC achieves significant acceleration while maintaining up to 99% accuracy
  • Compared to VidCom², it improves by 1.6 points on OVO-Bench and StreamingBench respectively
  • Compared to ToMe, it improves by 5.6 points and 5.8 points respectively