# Thinking-Bert: Exploring a Hierarchical Reasoning Architecture for Small Models to Achieve "Deep Thinking"

> An experimental project attempts to enable small encoder models (only 256 dimensions and 8 layers) to achieve "thinking" capabilities similar to large reasoning models. By using a two-layer iterative processing mechanism and Adaptive Computation Time (ACT) technology, it explores the reasoning potential of lightweight models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-01T09:56:43.000Z
- 最近活动: 2026-06-01T10:18:42.708Z
- 热度: 159.6
- 关键词: Transformer, 推理模型, 分层架构, 自适应计算, 轻量级模型, ModernBert, 迭代推理, 掩码语言模型
- 页面链接: https://www.zingnex.cn/en/forum/thread/thinking-bert
- Canonical: https://www.zingnex.cn/forum/thread/thinking-bert
- Markdown 来源: floors_fallback

---

## [Introduction] Thinking-Bert: Exploring a Hierarchical Reasoning Architecture for Small Models to Achieve "Deep Thinking"

This project is an experimental exploration aimed at enabling small encoder models (only 256 dimensions and 8 layers) to achieve "thinking" capabilities similar to large reasoning models. By integrating a two-layer iterative processing mechanism and Adaptive Computation Time (ACT) technology, it verifies the reasoning potential of lightweight models and provides new possibilities for resource-constrained scenarios (such as edge devices and mobile terminals).

## Background: Reflections on the Contradiction Between Large Model Reasoning Capabilities and Scale

From late 2024 to 2025, reasoning models (such as OpenAI o-series and DeepSeek-R1) performed well on complex tasks but relied on tens or hundreds of billions of parameters. Core question: Is reasoning capability necessarily proportional to model scale? The Thinking-Bert project integrates the efficient ModernBert architecture and the HierarchicalReasoningModel's hierarchical reasoning mechanism to attempt to achieve deep thinking on small models.

## Core Architecture: Two-Layer Iterative Information Flow Design

The core of the model is a hierarchical iterative processing mechanism, where the 8-layer Transformer is divided into two modules:
- **Low-level processor**: Handles local features, using sliding window attention (each token focuses on 128 surrounding tokens);
- **High-level processor**: Aggregates global information for abstract reasoning, receiving the mean aggregated representation of the low-level output.
Iterative loop process: Low-level fuses with the previous round's high-level state → T internal iterations → Mean aggregation passed to high-level → High-level updates global state → Broadcast back to low-level, repeat N times.

## Technical Highlights: Adaptive Computation and Differentiated Encoding Strategies

1. **Adaptive Computation Time (ACT)**: Dynamically determines the depth of thinking based on input complexity, predicting a Q-value to decide whether to stop;
2. **Rotary Position Encoding (RoPE) Dual-Frequency Strategy**: Low-level uses a base frequency of 10000 (fine-grained position awareness), while the global layer uses an extended frequency of 160000 (broad position generalization);
3. **Curriculum Learning**: Gradually increases sequence length (64→96→128) to stably learn hierarchical representations.

## Model Configuration and Reasoning Process

**Model Configuration**:
| Parameter | Value | Description |
|------|------|------|
| Dimension | 256 | Minimal hidden layer dimension |
| Layers | 8 | 4 low-level +4 high-level |
| Attention Heads |4 | Multi-head configuration |
| Vocabulary |16384 | Compact BPE vocabulary |
| Sequence Length |128 | Moderate context window |
| Batch Size |32 | Friendly training configuration |
**Reasoning Process**: Input encoding → Mask positioning → Tensor preparation → Iterative thinking → Result extraction → Decoding output.

## Significance and Outlook: A New Paradigm for Small Model Reasoning

1. **Architectural Innovation Over Scale Stacking**: Through designs like hierarchical iteration, small models gain reasoning capabilities, suitable for resource-constrained scenarios;
2. **Cognitive Science Inspiration**: Draws on the human dual-system theory (System 1: fast intuition, System 2: slow thinking);
3. **Open Source Community Value**: Quickly verifies cutting-edge ideas, follows up on the latest achievements from 2024-2025, and provides for community iteration.

## Limitations and Future Improvement Directions

**Limitations**: Limited information on the scale/quality of training data; lack of evaluation using standard reasoning benchmarks (GSM8K, HumanEval); stability challenges in iterative training and ACT.
**Future Directions**: Introduce larger pre-training data; design dedicated training objectives for reasoning tasks; combine distillation technology to transfer knowledge from large models; verify effectiveness across multiple downstream tasks.

## Conclusion: Reasoning Is Not a Patent of Large Models

Thinking-Bert proves that through clever architectural design and training strategies, lightweight models can also possess "thinking" capabilities. The future AI ecosystem may become diversified: cloud-based large models and edge-side small models each play their strengths, and this project is an important piece of the puzzle leading to a diversified future.
