Zing Forum

Reading

CVPR 2026 Open Source: STC Framework Accelerates Streaming Video Large Models, Reducing ViT Encoding Latency by 24.5%

The EPIC Lab team from Shanghai Jiao Tong University open-sourced the STC framework. Using hierarchical token compression technology, it reduces the ViT encoding latency of streaming video understanding models by 24.5% and LLM pre-filling latency by 45.3% while maintaining 99% accuracy.

CVPR 2026视频大模型流式视频Token压缩ViT加速LLM推理优化上海交大开源框架视频理解实时AI
Published 2026-06-04 23:44Recent activity 2026-06-04 23:52Estimated read 6 min
CVPR 2026 Open Source: STC Framework Accelerates Streaming Video Large Models, Reducing ViT Encoding Latency by 24.5%
1

Section 01

CVPR 2026 Open Source: STC Framework Accelerates Streaming Video Large Models, Reducing ViT Encoding Latency by 24.5%

The EPIC Lab team from Shanghai Jiao Tong University open-sourced the STC framework. Using hierarchical token compression technology, it reduces the ViT encoding latency of streaming video understanding models by 24.5% and LLM pre-filling latency by 45.3% while maintaining 99% accuracy. This framework has been accepted by CVPR 2026 and fully open-sourced, suitable for real-time video scenarios such as live streaming, AR glasses, and surveillance.

2

Section 02

Research Background and Challenges

Streaming video understanding requires real-time processing of continuous frames and is sensitive to latency. Existing video large models face three major bottlenecks: 1. High computational overhead for ViT encoding per frame; 2. Long video token sequences leading to time-consuming LLM pre-filling; 3. Ineffective utilization of temporal redundancy in consecutive frames. These issues have created a demand for efficient processing methods.

3

Section 03

Core Design of the STC Framework

STC (Streaming Token Compression) adopts a hierarchical token compression strategy, including two modules:

  • STC-Cacher: Reuses static tokens in the cache through difference detection, only encoding changed regions, reducing ViT encoding latency by 24.5%;
  • STC-Pruner: Compresses token sequences via spatiotemporal saliency pruning, reducing LLM pre-filling latency by 45.3% while maintaining 99% accuracy.
4

Section 04

Experimental Results and Performance

Benchmark Tests

Results on the ReKV framework:

Method ViT Encoding Latency LLM Pre-filling Latency Accuracy Change
ReKV Baseline 103.7s 482.4s -
+ STC 78.3s (↓24.5%) 263.7s (↓45.3%) Slight decrease only

Cross-Framework Versatility

Can be integrated into frameworks like Dispider and LiveCC. For example, StreamForest's ViT latency decreased from 103.7s to 67.7s (↓34.7%).

5

Section 05

Technical Implementation Details

  • HuggingFace Integration: A single line of monkey-patch can attach it to CLIP/SigLIP vision towers without retraining;
  • Parameter Configuration: Use environment variables (e.g., STC_TOKEN_PER_FRAME) to balance latency and accuracy;
  • Plug-and-Play: STC-Pruner intervenes after ViT encoding and before LLM pre-filling, adapting to Transformer backends.
6

Section 06

Open Source Ecosystem and Usage

  • Code Structure: The Python package stc includes core code, benchmark tests, and documentation;
  • Installation: pip install -e . (core package) or pip install -e .[hf] (with HuggingFace integration);
  • Reproducibility: Provides detailed guides for mainstream frameworks (ReKV, StreamForest, etc.).
7

Section 07

Research Significance and Outlook

  • Contributions: The first token compression framework for streaming scenarios, filling a gap in the field;
  • Applications: Improves experience in real-time scenarios like intelligent security, autonomous driving, and AR/VR;
  • Future Work: Develop LiveCC support and optimize algorithms to enhance efficiency.
8

Section 08

Summary

The STC framework significantly reduces the latency of streaming video models with high accuracy through hierarchical token compression. It is plug-and-play and cross-framework compatible. The open-source code and documentation provide practical optimization solutions for researchers and practitioners, promoting the development of real-time video AI applications.