Reading

YOCO-U: A New Transformer Architecture for Efficient Depth Expansion via Recursive Computation

YOCO-U combines the YOCO decoder architecture with recursive computation. Through a parameter-shared universal self-decoder and shallow efficient attention layers, it achieves depth expansion while maintaining constant KV cache and linear pre-filling, providing a new direction for efficient inference-time computation expansion.

YOCO架构递归计算TransformerKV缓存优化测试时扩展高效推理深度扩展

Published 2026-04-02 01:58Recent activity 2026-04-02 10:49Estimated read 7 min

YOCO-U: A New Transformer Architecture for Efficient Depth Expansion via Recursive Computation

Section 01

YOCO-U: Introduction to the New Transformer Architecture for Efficient Depth Expansion

YOCO-U combines the YOCO decoder architecture with recursive computation. Through a parameter-shared universal self-decoder and shallow efficient attention layers, it achieves depth expansion while maintaining constant KV cache and linear pre-filling. It solves the computational overhead and KV cache inflation problems of standard Transformers during inference, providing a new direction for efficient inference-time computation expansion.

Section 02

Dilemmas of Inference-Time Expansion and Background of Existing Technologies

Rise and Dilemmas of Inference-Time Computation

In recent years, test-time expansion techniques have improved the inference capabilities of large language models, but standard Transformers face bottlenecks of high computational overhead (recalculating attention in each iteration) and KV cache inflation (growing linearly with depth), leading to high costs for test-time expansion.

Advantages of the YOCO Architecture

The YOCO architecture adopts a decoder-decoder structure. It achieves constant cache size by sharing global KV cache through shallow efficient attention layers, and its pre-filling complexity is linear, making it more efficient for processing long sequences.

Potential and Limitations of Recursive Computation

Recursive computation can enhance representation depth, but when used alone, it has problems of high computational overhead and cache inflation. It needs to be combined with efficient cache management to achieve synergistic effects.

Section 03

YOCO-U Architecture Design and Technical Details

Core Design of YOCO-U

YOCO-U combines YOCO with recursive computation. Its core is the universal self-decoder: it performs multiple iterations on shallow efficient attention layers through parameter sharing. The deep standard decoder is responsible for extracting semantics, while the shallow layer recursively refines representations, maintaining a constant KV cache.

Key Technical Details

Recursive Position Selection: Restricted to shallow layers to handle local patterns and low-level features;
Parameter Sharing: Keeps the number of parameters unchanged and learns a universal refinement strategy;
Adaptive Termination: Determines the recursive depth based on input complexity.

Section 04

Experimental Verification Results of YOCO-U

General Benchmark Tests

Compared with non-recursive YOCO models of the same scale, YOCO-U shows significant improvements in multiple tasks (especially multi-step reasoning), with limited increase in inference latency.

Long Context Tests

The constant KV cache can handle long documents of tens of thousands of tokens. The recursive mechanism better captures long-distance dependencies, leading to excellent performance in document-level understanding tasks.

Expansion Behavior

As the recursive depth increases, the model's capabilities continue to improve, and the computational cost grows gently—superior to the linear/superlinear growth of standard Transformers.

Section 05

Architectural Insights and Application Prospects of YOCO-U

Architectural Design Insights

Multi-dimensional Collaboration: Combine complementary points of different technologies;
Fine-grained Resource Allocation: Shallow and deep layers take on different roles;
Smart Computation: Optimize test-time expansion through architectural innovation.

Application Prospects

Suitable for long document processing (legal analysis, medical reviews), deep reasoning (mathematical proofs, code debugging), and resource-constrained environments (edge devices, real-time systems). It can dynamically adjust recursive depth to balance quality and speed.

Conclusion

YOCO-U is an important milestone in the evolution of Transformers. Through architectural innovation, it achieves depth expansion without sacrificing efficiency, providing a sustainable path for test-time expansion.

Section 06

Limitations of YOCO-U and Future Research Directions

Current Limitations

The adaptive recursive depth still needs optimization;
More research is needed on the adaptation of the recursive mechanism to specific tasks.

Future Directions

Explore complex structures such as hierarchical/conditional recursion;
Combine efficiency technologies like sparse attention and quantization;
Apply to multi-modal models.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15