# FusionCIM: Fusion-Driven In-Memory Computing Architecture Accelerates Large Model Inference

> FusionCIM achieves a 3.86x energy efficiency improvement and a system-level energy efficiency of 29.4 TOPS/W on LLaMA-3 through three key innovations: hybrid CIM pipeline, QO stationary dataflow, and pattern-aware online softmax.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-28T07:27:58.000Z
- 最近活动: 2026-04-29T03:01:43.299Z
- 热度: 129.4
- 关键词: 存内计算, CIM, 大模型推理, 算子融合, 注意力机制, LLaMA-3, AI加速器
- 页面链接: https://www.zingnex.cn/en/forum/thread/fusioncim
- Canonical: https://www.zingnex.cn/forum/thread/fusioncim
- Markdown 来源: floors_fallback

---

## FusionCIM: Fusion-Driven In-Memory Computing Architecture Accelerates Large Model Inference (Introduction)

# FusionCIM Introduction

FusionCIM is a fusion-driven in-memory computing (CIM) architecture. To address the challenges of applying CIM to large model inference, it proposes three key innovations: hybrid CIM pipeline, QO stationary dataflow, and pattern-aware online softmax. On LLaMA-3, it achieves a 3.86x energy efficiency improvement and a system-level energy efficiency of 29.4 TOPS/W, providing a reference for AI accelerator design.

## Opportunities and Challenges of In-Memory Computing

# Opportunities and Challenges of In-Memory Computing

In-memory computing (CIM) is a key technology to break through the memory wall bottleneck of the von Neumann architecture, with the core idea of embedding computing into memory arrays to reduce data movement. However, applying it to LLM inference faces three major challenges:
- Complexity of operator fusion: The attention mechanism involves combinations of multiple matrix operations
- Difficulty in dataflow optimization: The dynamic nature of KV cache makes static dataflow optimization challenging
- Overhead of nonlinear operations: Operations like Softmax are inefficient in the analog domain

FusionCIM proposes a systematic solution to these challenges.

## Core Innovation 1: Hybrid CIM Pipeline Architecture

# Core Innovation 1: Hybrid CIM Pipeline Architecture

FusionCIM adopts a hybrid CIM paradigm based on the characteristics of different matrix operations in the attention mechanism:
- **QK^T computation → Inner Product CIM (IP-CIM)**：Uses analog current accumulation to achieve efficient dot products, suitable for parallel computing needs
- **PV aggregation → Outer Product CIM (OP-CIM)**：Supports outer product broadcasting through row-column cross current summation, reducing intermediate storage

Intelligently scheduling these two modes enables deep fusion of matrix multiplications and eliminates data movement.

## Core Innovation 2: QO Stationary Dataflow

# Core Innovation 2: QO Stationary Dataflow

Traditional attention requires repeated loading of KV cache, leading to high bandwidth pressure. FusionCIM proposes the QO-stationary strategy:
- **Core Idea**: In transposed fusion scenarios, keep Q and O stationary while flowing K and V
- **Optimization Points**: Eliminate repeated KV loading and K matrix buffer access, improving on-chip data reuse

This nearly doubles the bandwidth utilization of on-chip storage, which is an important source of energy efficiency advantages.

## Core Innovation 3: Pattern-Aware Online Softmax

# Core Innovation 3: Pattern-Aware Online Softmax

Softmax is a bottleneck for CIM efficiency. FusionCIM optimizes using the distribution law of attention scores:
- **Observation**: Attention scores are sparse, with extremely high scores in a few positions
- **Strategy**: Dynamically adjust the precision of exponential computation, approximate computation in low-score regions, and online rescaling to avoid full normalization

This reduces the overhead of nonlinear fusion by more than 60% while maintaining model accuracy.

## Experimental Validation: Performance on LLaMA-3

# Experimental Validation: Performance on LLaMA-3

Evaluation on LLaMA-3:
- **Energy Efficiency**: 3.86x improvement (energy consumption reduced to 26%), system-level energy efficiency of 29.4 TOPS/W (leading level)
- **Speed**: 1.98x acceleration
- **Architecture Gains**: Hybrid CIM pipeline (45%), QO dataflow (35%), Softmax optimization (20%)

The performance improvement comes from operator fusion and efficient dataflow scheduling.

## Technical Insights and Future Directions

# Technical Insights and Future Directions

**Technical Insights**:
1. Heterogeneous CIM combinations are superior to single architectures
2. Dataflow design is as important as computing units
3. Algorithm-hardware co-design has great potential

**Future Directions**:
- Extend to multimodal models
- Explore more aggressive approximate computing
- Combine advanced packaging to improve integration

FusionCIM is an important progress of CIM in the field of LLM inference, providing a reference for the next generation of AI accelerators.