Zing Forum

Reading

Thinking-Bert: Exploring a Hierarchical Reasoning Architecture for Small Models to Achieve "Deep Thinking"

An experimental project attempts to enable small encoder models (only 256 dimensions and 8 layers) to achieve "thinking" capabilities similar to large reasoning models. By using a two-layer iterative processing mechanism and Adaptive Computation Time (ACT) technology, it explores the reasoning potential of lightweight models.

Transformer推理模型分层架构自适应计算轻量级模型ModernBert迭代推理掩码语言模型
Published 2026-06-01 17:56Recent activity 2026-06-01 18:18Estimated read 7 min
Thinking-Bert: Exploring a Hierarchical Reasoning Architecture for Small Models to Achieve "Deep Thinking"
1

Section 01

[Introduction] Thinking-Bert: Exploring a Hierarchical Reasoning Architecture for Small Models to Achieve "Deep Thinking"

This project is an experimental exploration aimed at enabling small encoder models (only 256 dimensions and 8 layers) to achieve "thinking" capabilities similar to large reasoning models. By integrating a two-layer iterative processing mechanism and Adaptive Computation Time (ACT) technology, it verifies the reasoning potential of lightweight models and provides new possibilities for resource-constrained scenarios (such as edge devices and mobile terminals).

2

Section 02

Background: Reflections on the Contradiction Between Large Model Reasoning Capabilities and Scale

From late 2024 to 2025, reasoning models (such as OpenAI o-series and DeepSeek-R1) performed well on complex tasks but relied on tens or hundreds of billions of parameters. Core question: Is reasoning capability necessarily proportional to model scale? The Thinking-Bert project integrates the efficient ModernBert architecture and the HierarchicalReasoningModel's hierarchical reasoning mechanism to attempt to achieve deep thinking on small models.

3

Section 03

Core Architecture: Two-Layer Iterative Information Flow Design

The core of the model is a hierarchical iterative processing mechanism, where the 8-layer Transformer is divided into two modules:

  • Low-level processor: Handles local features, using sliding window attention (each token focuses on 128 surrounding tokens);
  • High-level processor: Aggregates global information for abstract reasoning, receiving the mean aggregated representation of the low-level output. Iterative loop process: Low-level fuses with the previous round's high-level state → T internal iterations → Mean aggregation passed to high-level → High-level updates global state → Broadcast back to low-level, repeat N times.
4

Section 04

Technical Highlights: Adaptive Computation and Differentiated Encoding Strategies

  1. Adaptive Computation Time (ACT): Dynamically determines the depth of thinking based on input complexity, predicting a Q-value to decide whether to stop;
  2. Rotary Position Encoding (RoPE) Dual-Frequency Strategy: Low-level uses a base frequency of 10000 (fine-grained position awareness), while the global layer uses an extended frequency of 160000 (broad position generalization);
  3. Curriculum Learning: Gradually increases sequence length (64→96→128) to stably learn hierarchical representations.
5

Section 05

Model Configuration and Reasoning Process

Model Configuration:

Parameter Value Description
Dimension 256 Minimal hidden layer dimension
Layers 8 4 low-level +4 high-level
Attention Heads 4 Multi-head configuration
Vocabulary 16384 Compact BPE vocabulary
Sequence Length 128 Moderate context window
Batch Size 32 Friendly training configuration
Reasoning Process: Input encoding → Mask positioning → Tensor preparation → Iterative thinking → Result extraction → Decoding output.
6

Section 06

Significance and Outlook: A New Paradigm for Small Model Reasoning

  1. Architectural Innovation Over Scale Stacking: Through designs like hierarchical iteration, small models gain reasoning capabilities, suitable for resource-constrained scenarios;
  2. Cognitive Science Inspiration: Draws on the human dual-system theory (System 1: fast intuition, System 2: slow thinking);
  3. Open Source Community Value: Quickly verifies cutting-edge ideas, follows up on the latest achievements from 2024-2025, and provides for community iteration.
7

Section 07

Limitations and Future Improvement Directions

Limitations: Limited information on the scale/quality of training data; lack of evaluation using standard reasoning benchmarks (GSM8K, HumanEval); stability challenges in iterative training and ACT. Future Directions: Introduce larger pre-training data; design dedicated training objectives for reasoning tasks; combine distillation technology to transfer knowledge from large models; verify effectiveness across multiple downstream tasks.

8

Section 08

Conclusion: Reasoning Is Not a Patent of Large Models

Thinking-Bert proves that through clever architectural design and training strategies, lightweight models can also possess "thinking" capabilities. The future AI ecosystem may become diversified: cloud-based large models and edge-side small models each play their strengths, and this project is an important piece of the puzzle leading to a diversified future.