# STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

> The STC framework proposed by the EPIC Lab at Shanghai Jiao Tong University provides plug-and-play acceleration for streaming video large language models via hierarchical token compression technology. It significantly reduces inference latency while maintaining 99% accuracy and has been accepted by CVPR 2026.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-04T15:44:54.000Z
- 最近活动: 2026-06-04T15:50:36.047Z
- 热度: 163.9
- 关键词: CVPR 2026, 视频大语言模型, 流式视频, Token压缩, 推理加速, 上海交通大学, 计算机视觉, 深度学习, GitHub, 开源
- 页面链接: https://www.zingnex.cn/en/forum/thread/stc-cvpr-2026-token
- Canonical: https://www.zingnex.cn/forum/thread/stc-cvpr-2026-token
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: STC: CVPR 2026 Accelerator Framework for Streaming Video Large Language Models, Enabling Real-Time Inference via Hierarchical Token Compression

The STC framework proposed by the EPIC Lab at Shanghai Jiao Tong University provides plug-and-play acceleration for streaming video large language models via hierarchical token compression technology. It significantly reduces inference latency while maintaining 99% accuracy and has been accepted by CVPR 2026.

## Original Authors and Source

- **Original Author/Maintainer**: EPIC Lab, SJTU (Shanghai Jiao Tong University Intelligent Computing Laboratory)
- **Source Platform**: GitHub
- **Original Title**: STC: Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
- **Original Link**: https://github.com/lern-to-write/STC
- **Release Date**: June 4, 2026
- **Paper Link**: https://arxiv.org/abs/2512.00891

---

## Research Background and Problem Definition

Video Large Language Models (Video LLMs) are developing rapidly, but streaming video understanding scenarios face severe performance challenges. In applications like live streaming, AR glasses, and long-term surveillance, video frames arrive continuously. Traditional methods require complete visual encoding and LLM pre-filling for each frame, leading to accumulated latency that is hard to meet real-time requirements.

Core difficulties of streaming video understanding:
1. **Computational Redundancy**: There is significant temporal redundancy between adjacent video frames; fully re-encoding each frame is inefficient
2. **Sequence Inflation**: The number of tokens in long video sequences is huge, and LLM pre-filling time grows linearly with sequence length
3. **Real-Time Constraints**: Latency-sensitive scenarios require millisecond-level responses, making traditional batch processing methods unsuitable

---

## Core Innovations of STC

STC (Streaming Token Compression) is the first plug-and-play inference acceleration framework for streaming video understanding, and it has been accepted by CVPR 2026. Its core innovations are reflected in two aspects:

## 1. STC-Cacher: Intelligent Visual Token Caching

STC-Cacher leverages the temporal redundancy of video to selectively recompute only the dynamically changing visual tokens in each frame, while reusing the rest from the cache.

**Technical Mechanism**:
- By comparing the visual features of the current frame with the reference frame, identify the spatial regions that have changed
- Re-encode only the tokens in the changed regions; directly reuse the cache for static regions
- Set a complete reference frame every N frames to balance cache efficiency and drift accumulation

**Performance Gain**: On the ReKV framework, ViT encoding latency is reduced by 24.5%

## 2. STC-Pruner: Hierarchical Token Compression

After visual encoding is completed, STC-Pruner compresses the token sequence to reduce the sequence length for LLM pre-filling while preserving spatiotemporal saliency.

**Technical Features**:
- Select the most important tokens based on visual saliency
- Configurable per-frame token budget (e.g., 64 vs 196 full tokens)
- Works in collaboration with Cacher to form a hierarchical compression pipeline

**Performance Gain**: On the ReKV framework, LLM pre-filling latency is reduced by 45.3% while maintaining up to 99% of the original accuracy

---

## Framework Compatibility and Integration

STC is designed as a model-agnostic core component that can be quickly integrated into mainstream streaming VideoLLM frameworks:

| Framework | Vision Tower | STC-Cacher | STC-Pruner | Status |
|------|--------|------------|------------|------|
| ReKV | SigLIP (LLaVA-OneVision) | ✅ | ✅ | Reference Implementation |
| StreamForest | SigLIP | ✅ | — | Per-frame Streaming Cache |
| Dispider | CLIP | ✅ | — | Per-frame Streaming Cache |
| LiveCC | — | 🔜 | 🔜 | Integration in Progress |

**Integration Methods**:
- STC-Cacher is attached to any HuggingFace pre-LN CLIP/SigLIP vision tower via a single line of monkey-patch
- STC-Pruner is an explicit call that performs token compression before LLM pre-filling

---

## Main Experimental Results (ReKV Framework)

On the OVO-Bench and StreamingBench benchmarks, STC outperforms the baseline and other compression methods:

| Method | OVO Real-Time | OVO Retrospective | OVO Prospective | StreamingBench | ViT Encoding Latency | LLM Pre-filling Latency |
|------|---------|---------|---------|----------------|-------------|---------------|
| ReKV Baseline | 64.4 | 64.6 | 52.6 | 69.1 | 103.7s | 482.4s |
| + ToMe | 53.1 | 60.7 | 46.4 | 59.4 | 70.5s (↓32%) | 257.8s (↓46.6%) |
| + VisionZip | 53.8 | 58.4 | 47.5 | 60.4 | 103.7s | 258.3s (↓46.5%) |
| + VidCom² | 60.4 | 59.0 | 50.4 | 63.6 | 103.7s | 259.1s (↓46.3%) |
| **+ STC** | **62.5** | **63.3** | **52.0** | **65.2** | **78.3s (↓24.5%)** | **263.7s (↓45.3%)** |

**Key Findings**:
- STC achieves significant acceleration while maintaining up to 99% accuracy
- Compared to VidCom², it improves by 1.6 points on OVO-Bench and StreamingBench respectively
- Compared to ToMe, it improves by 5.6 points and 5.8 points respectively
