# CVPR 2026 Open Source: STC Framework Accelerates Streaming Video Large Models, Reducing ViT Encoding Latency by 24.5%

> The EPIC Lab team from Shanghai Jiao Tong University open-sourced the STC framework. Using hierarchical token compression technology, it reduces the ViT encoding latency of streaming video understanding models by 24.5% and LLM pre-filling latency by 45.3% while maintaining 99% accuracy.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-04T15:44:54.000Z
- 最近活动: 2026-06-04T15:52:14.351Z
- 热度: 163.9
- 关键词: CVPR 2026, 视频大模型, 流式视频, Token压缩, ViT加速, LLM推理优化, 上海交大, 开源框架, 视频理解, 实时AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/cvpr-2026-stc-vit24-5
- Canonical: https://www.zingnex.cn/forum/thread/cvpr-2026-stc-vit24-5
- Markdown 来源: floors_fallback

---

## CVPR 2026 Open Source: STC Framework Accelerates Streaming Video Large Models, Reducing ViT Encoding Latency by 24.5%

The EPIC Lab team from Shanghai Jiao Tong University open-sourced the STC framework. Using hierarchical token compression technology, it reduces the ViT encoding latency of streaming video understanding models by 24.5% and LLM pre-filling latency by 45.3% while maintaining 99% accuracy. This framework has been accepted by CVPR 2026 and fully open-sourced, suitable for real-time video scenarios such as live streaming, AR glasses, and surveillance.

## Research Background and Challenges

Streaming video understanding requires real-time processing of continuous frames and is sensitive to latency. Existing video large models face three major bottlenecks: 1. High computational overhead for ViT encoding per frame; 2. Long video token sequences leading to time-consuming LLM pre-filling; 3. Ineffective utilization of temporal redundancy in consecutive frames. These issues have created a demand for efficient processing methods.

## Core Design of the STC Framework

STC (Streaming Token Compression) adopts a hierarchical token compression strategy, including two modules:
- **STC-Cacher**: Reuses static tokens in the cache through difference detection, only encoding changed regions, reducing ViT encoding latency by 24.5%;
- **STC-Pruner**: Compresses token sequences via spatiotemporal saliency pruning, reducing LLM pre-filling latency by 45.3% while maintaining 99% accuracy.

## Experimental Results and Performance

### Benchmark Tests
Results on the ReKV framework:
| Method | ViT Encoding Latency | LLM Pre-filling Latency | Accuracy Change |
|--------|-----------------------|-------------------------|------------------|
| ReKV Baseline | 103.7s | 482.4s | - |
| + STC | 78.3s (↓24.5%) | 263.7s (↓45.3%) | Slight decrease only |

### Cross-Framework Versatility
Can be integrated into frameworks like Dispider and LiveCC. For example, StreamForest's ViT latency decreased from 103.7s to 67.7s (↓34.7%).

## Technical Implementation Details

- **HuggingFace Integration**: A single line of monkey-patch can attach it to CLIP/SigLIP vision towers without retraining;
- **Parameter Configuration**: Use environment variables (e.g., STC_TOKEN_PER_FRAME) to balance latency and accuracy;
- **Plug-and-Play**: STC-Pruner intervenes after ViT encoding and before LLM pre-filling, adapting to Transformer backends.

## Open Source Ecosystem and Usage

- **Code Structure**: The Python package `stc` includes core code, benchmark tests, and documentation;
- **Installation**: `pip install -e .` (core package) or `pip install -e .[hf]` (with HuggingFace integration);
- **Reproducibility**: Provides detailed guides for mainstream frameworks (ReKV, StreamForest, etc.).

## Research Significance and Outlook

- **Contributions**: The first token compression framework for streaming scenarios, filling a gap in the field;
- **Applications**: Improves experience in real-time scenarios like intelligent security, autonomous driving, and AR/VR;
- **Future Work**: Develop LiveCC support and optimize algorithms to enhance efficiency.

## Summary

The STC framework significantly reduces the latency of streaming video models with high accuracy through hierarchical token compression. It is plug-and-play and cross-framework compatible. The open-source code and documentation provide practical optimization solutions for researchers and practitioners, promoting the development of real-time video AI applications.
