Reading

CosmicFish-HRM: A Hierarchical Recursive Language Model with Dynamically Adjustable Inference Depth

Introduces the architectural design of CosmicFish-HRM, including how its Hierarchical Recursive Module (HRM) dynamically allocates computing resources during inference to achieve an efficient and adaptive reasoning process.

自适应计算动态推理层级递归语言模型Transformer边缘部署计算效率HRM

Published 2026-05-30 00:10Recent activity 2026-05-30 00:18Estimated read 6 min

CosmicFish-HRM: A Hierarchical Recursive Language Model with Dynamically Adjustable Inference Depth

Section 01

CosmicFish-HRM: Guide to the Hierarchical Recursive Language Model with Dynamically Adjustable Inference Depth

CosmicFish-HRM is a language model developed by the Mistyoz AI team in Hyderabad, India. Its core innovation is the Hierarchical Recursive Module (HRM), which dynamically allocates computing resources during inference. This solves the resource waste problem caused by the "one-size-fits-all" inference approach of traditional large language models, supports adaptive computing, and is suitable for edge deployment. The code is open-sourced under the Apache-2.0 license and includes complete training and inference workflows.

Section 02

Background: Challenges in LLM Inference Efficiency and the Direction of Adaptive Computing

Current large language models use the same computing resources regardless of input complexity, leading to resource waste and limiting deployment in resource-constrained environments. Adaptive computing aims to enable models to determine the necessary inference steps, and CosmicFish-HRM is a concrete implementation of this research direction.

Section 03

Project Overview: Core Components and Open-Source Information

Development team: Mistyoz AI (Hyderabad, India)
Core component: Hierarchical Recursive Module (HRM), enabling dynamic adjustment of inference depth
Open-source license: Apache-2.0
Source platform: GitHub, release date May 29, 2026
Code coverage: Complete workflows for data preparation, training, fine-tuning, and quantization

Section 04

Core Architecture: Dynamic Inference Mechanism and Technical Details

Overall Structure

Input Transformer block → HRM core → Output Transformer block → Language model head

HRM Module

H-level: Macro semantic understanding and cross-token relationship modeling
L-level: Fine-grained feature extraction and local pattern recognition

Dynamic Stopping Mechanism

Evaluates hidden states via a halt/continue Q-head to decide whether to continue reasoning or output. Simple tokens require only 1-2 steps, while complex tokens take up to 16 steps.

Technical Configuration

Vocabulary size: 50304, embedding dimension: 448, context length: 512, 6 input/output layers each, 4 HRM H/L layers each, GQA attention heads, RoPE positional encoding, etc.

Section 05

Training Workflow: Data Composition and Multi-Stage Strategy

Data Composition

Approximately 10 billion tokens, sourced from FineWeb (3B), Wikipedia (3B), OpenWebText (1B), C4 (1B), CodeParrot (1B), OpenWebMath (500M), and ArXiv (500M).

Training Stages

Pre-training: Basic language modeling on 10 billion tokens
Dialogue fine-tuning: Adjust to dialogue format
Instruction fine-tuning: Train on the Alpaca-GPT4 dataset
Identity calibration: Stabilize personality using mixed identity and dialogue data

Optimization Strategy

Training loss includes a step penalty to encourage efficient reasoning.

Section 06

Inference and Usage: Interactive Features and Parameter Adjustment

Interactive interface: chat.py supports parameter adjustment for temperature, max_tokens, top_k, etc.
Key options: --show_hrm_steps to display inference steps, --force_hrm_steps for manual override
Supported commands: /temp (adjust temperature), /tokens (max generation count), /hrm (adjust steps), etc.

Section 07

Practical Significance, Limitations, and Summary Recommendations

Significance and Advantages

Explores dynamic computing allocation during inference, complementing pre-training optimization ideas
Edge deployment-friendly, predictable response speed, high resource utilization

Limitations

Context length is only 512, and the model size is relatively small (around 400M parameters)

Summary Recommendations

The model achieves adaptive reasoning via HRM, inspiring the construction of efficient language models; it is recommended that developers researching adaptive computing study the open-source code of this project.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15