Reading

STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

The STC framework proposed by the EPIC Lab at Shanghai Jiao Tong University provides plug-and-play acceleration for streaming video large language models via hierarchical token compression technology. It significantly reduces inference latency while maintaining 99% accuracy and has been accepted by CVPR 2026.

CVPR 2026视频大语言模型流式视频Token压缩推理加速上海交通大学计算机视觉深度学习GitHub开源

Published 2026-06-04 23:44Recent activity 2026-06-04 23:50Estimated read 8 min

STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

Section 01

Introduction / Main Floor: STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

Section 02

Original Authors and Source

Original Author/Maintainer: EPIC Lab, SJTU (Shanghai Jiao Tong University Intelligent Computing Laboratory)
Source Platform: GitHub
Original Title: STC: Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
Original Link: https://github.com/lern-to-write/STC
Release Date: June 4, 2026
Paper Link: https://arxiv.org/abs/2512.00891

Section 03

Research Background and Problem Definition

Video Large Language Models (Video LLMs) are developing rapidly, but streaming video understanding scenarios face severe performance challenges. In applications like live streaming, AR glasses, and long-term surveillance, video frames arrive continuously. Traditional methods require complete visual encoding and LLM pre-filling for each frame, leading to accumulated latency that is hard to meet real-time requirements.

Core difficulties of streaming video understanding:

Computational Redundancy: There is significant temporal redundancy between adjacent video frames; fully re-encoding each frame is inefficient
Sequence Inflation: The number of tokens in long video sequences is huge, and LLM pre-filling time grows linearly with sequence length
Real-Time Constraints: Latency-sensitive scenarios require millisecond-level responses, making traditional batch processing methods unsuitable

Section 04

Core Innovations of STC

STC (Streaming Token Compression) is the first plug-and-play inference acceleration framework for streaming video understanding, and it has been accepted by CVPR 2026. Its core innovations are reflected in two aspects:

Section 05

1. STC-Cacher: Intelligent Visual Token Caching

STC-Cacher leverages the temporal redundancy of video to selectively recompute only the dynamically changing visual tokens in each frame, while reusing the rest from the cache.

Technical Mechanism:

By comparing the visual features of the current frame with the reference frame, identify the spatial regions that have changed
Re-encode only the tokens in the changed regions; directly reuse the cache for static regions
Set a complete reference frame every N frames to balance cache efficiency and drift accumulation

Performance Gain: On the ReKV framework, ViT encoding latency is reduced by 24.5%

Section 06

2. STC-Pruner: Hierarchical Token Compression

After visual encoding is completed, STC-Pruner compresses the token sequence to reduce the sequence length for LLM pre-filling while preserving spatiotemporal saliency.

Technical Features:

Select the most important tokens based on visual saliency
Configurable per-frame token budget (e.g., 64 vs 196 full tokens)
Works in collaboration with Cacher to form a hierarchical compression pipeline

Performance Gain: On the ReKV framework, LLM pre-filling latency is reduced by 45.3% while maintaining up to 99% of the original accuracy

Section 07

Framework Compatibility and Integration

STC is designed as a model-agnostic core component that can be quickly integrated into mainstream streaming VideoLLM frameworks:

Framework	Vision Tower	STC-Cacher	STC-Pruner	Status
ReKV	SigLIP (LLaVA-OneVision)	✅	✅	Reference Implementation
StreamForest	SigLIP	✅	—	Per-frame Streaming Cache
Dispider	CLIP	✅	—	Per-frame Streaming Cache
LiveCC	—	🔜	🔜	Integration in Progress

Integration Methods:

STC-Cacher is attached to any HuggingFace pre-LN CLIP/SigLIP vision tower via a single line of monkey-patch
STC-Pruner is an explicit call that performs token compression before LLM pre-filling

Section 08

Main Experimental Results (ReKV Framework)

On the OVO-Bench and StreamingBench benchmarks, STC outperforms the baseline and other compression methods:

Method	OVO Real-Time	OVO Retrospective	OVO Prospective	StreamingBench	ViT Encoding Latency	LLM Pre-filling Latency
ReKV Baseline	64.4	64.6	52.6	69.1	103.7s	482.4s
+ ToMe	53.1	60.7	46.4	59.4	70.5s (↓32%)	257.8s (↓46.6%)
+ VisionZip	53.8	58.4	47.5	60.4	103.7s	258.3s (↓46.5%)
+ VidCom²	60.4	59.0	50.4	63.6	103.7s	259.1s (↓46.3%)
+ STC	62.5	63.3	52.0	65.2	78.3s (↓24.5%)	263.7s (↓45.3%)

Key Findings:

STC achieves significant acceleration while maintaining up to 99% accuracy
Compared to VidCom², it improves by 1.6 points on OVO-Bench and StreamingBench respectively
Compared to ToMe, it improves by 5.6 points and 5.8 points respectively

STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

Introduction / Main Floor: STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

Original Authors and Source

Research Background and Problem Definition

Core Innovations of STC

1. STC-Cacher: Intelligent Visual Token Caching

2. STC-Pruner: Hierarchical Token Compression

Framework Compatibility and Integration

Main Experimental Results (ReKV Framework)

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization