Zing Forum

Reading

Conan: A Hybrid Self-Improvement Training Framework for Human-Machine Collaborative Reasoning Models

Conan is a prototype project for reasoning model training that prioritizes automatic closed-loop operations with human decision-making at key nodes as a supplement. It achieves model self-improvement through hybrid training strategies and incorporates human decisions at critical points to enhance training quality.

Conan推理模型混合训练人机协同自动训练强化学习SFTDPO模型自改进训练框架
Published 2026-04-02 14:36Recent activity 2026-04-02 14:55Estimated read 7 min
Conan: A Hybrid Self-Improvement Training Framework for Human-Machine Collaborative Reasoning Models
1

Section 01

Conan: Guide to the Hybrid Self-Improvement Training Framework for Human-Machine Collaborative Reasoning Models

Conan is a prototype project for reasoning model training that prioritizes automatic closed-loop operations with human decision-making at key nodes as a supplement, and it is currently in the MVP phase. Its core goal is to build a system with clear control flow and module boundaries, achieve model self-improvement through hybrid training strategies, and strike a balance between automation efficiency and human-driven quality. The project supports experiment tracking and reproducibility, and will gradually integrate real components and expand functions in the future.

2

Section 02

Background and Core Concepts of the Conan Project

Large Reasoning Models (LRMs) face challenges in training: fully automated processes lack human intuition guidance, while complete reliance on humans is difficult to scale. Conan's core concept is 'automation first, human assistance second': links like data generation and automatic evaluation run in an automated closed loop; human expert decisions are introduced at key nodes such as reward calibration and failure mode diagnosis to verify whether the hybrid strategy outperforms the pure automatic baseline at minimal cost.

3

Section 03

System Architecture and Core Components of Conan

Conan adopts a modular design, with core components including:

  1. Training Engine: Coordinates various modules and supports single-round/batch execution;
  2. Task Generator: A placeholder module in the MVP phase; real task generation logic will be integrated later;
  3. Auto Evaluator: Evaluates the correctness of model outputs and the rationality of reasoning;
  4. Training Pipeline: Supports switching between training strategies like SFT, RL, and DPO;
  5. Decision Routing System: Provides three diversion strategies: approve (auto-pass), review (human review), and block (block/pause).
4

Section 04

Human Review Mechanism and Intelligent Trigger Strategy

Conan's human review mechanism includes:

  • Review Queue: Automatically collects review/block samples, and experts fill back conclusions after review;
  • Metric Analysis: Counts the proportion of approve/review/block to understand model performance trends and the distribution of human intervention;
  • Intelligent Trigger: Automatically recommends human intervention nodes (such as continuous failures, reward drift) based on metrics;
  • Strategy Switching Recommendations: Recommends switching strategies like SFT (correction), RL (fine optimization), and DPO (preference alignment) based on metric changes.
5

Section 05

Technical Implementation Details of Conan

Technical details of Conan:

  • Development Environment: Python3.10+, pytest testing framework, managed via pyproject.toml;
  • Code Structure: src/hybrid_trainer includes modules like engine.py (training engine) and evaluation.py (evaluation);
  • MVP Status: Currently focuses on control flow correctness and module boundaries; task generator, evaluator, etc., are placeholder implementations;
  • Experiment Tracking: Records cycle information, evaluation metrics, human intervention, etc., and exports in JSONL format to ensure reproducibility.
6

Section 06

Future Development Plan of the Conan Project

Development plan of Conan:

  • Short-term Goals: Integrate real components, configure reward strategies, and integrate training executors;
  • Mid-term Goals: Develop a graphical human decision-making interface, support custom trigger rules, and expand multi-model support;
  • Long-term Vision: Become an infrastructure in the field of reasoning model training and provide a complete human-machine collaborative training toolchain.
7

Section 07

Industry Insights and Summary of Conan

Industry insights from Conan:

  1. Human-machine collaboration is an inevitable path: Under current technology, the intervention of human experts at key decision points can improve training quality;
  2. Observability is crucial: Metric aggregation and experiment tracking help understand training status and support correct decisions;
  3. Modular design promotes iteration: Independent components are easy to replace and evolve quickly.

Summary: Conan is an innovative exploration in the field of reasoning model training. It realizes human-machine collaboration through a systematic framework. Although it is in the MVP phase, it has significant potential and is expected to push the boundaries of model capabilities.