Zing Forum

Reading

How Post-Training Shapes Biological Reasoning Models: Differential Impacts of Training Phases on Generalization Ability

By constructing and evaluating over 100 biological reasoning models, the study reveals the differential impacts of post-training phases on generalization ability: continuous pre-training aligns with biological language; supervised fine-tuning improves in-domain performance but causes out-of-domain performance to first rise then fall; reinforcement learning restores generalization ability. The study shows that biological reasoning performance does not increase monotonically with the amount of supervision.

生物推理模型后训练持续预训练监督微调强化学习泛化能力过特化ID-OOD权衡
Published 2026-06-15 18:19Recent activity 2026-06-16 11:03Estimated read 9 min
How Post-Training Shapes Biological Reasoning Models: Differential Impacts of Training Phases on Generalization Ability
1

Section 01

【Introduction】How Post-Training Shapes Biological Reasoning Models: Core Findings and Significance

Research Theme

Differential impacts of post-training phases on the generalization ability of biological reasoning models

Core Conclusions

By constructing and evaluating over 100 biological reasoning models, the study reveals:

  • Continuous Pre-training (CPT) aligns with biological language, improving both in-domain (ID) and out-of-domain (OOD) performance;
  • Supervised Fine-tuning (SFT) improves in-domain performance but leads to out-of-domain performance first rising then falling (over-specialization);
  • Reinforcement Learning (RL) restores generalization ability;
  • Biological reasoning performance does not increase monotonically with the amount of supervision.

Source Information

2

Section 02

Research Background: Post-Training Dilemmas and Key Questions in Biological AI

Transformation of Biological AI

Biological science is undergoing an AI-driven revolution—from protein structure prediction to disease diagnosis, AI models are reshaping all aspects of research.

Typical Architectures

Current biological reasoning model architectures:

  1. Foundation Language Models (general language understanding)
  2. Biological Foundation Models (pre-trained encoders for biological sequences)
  3. Multimodal Fusion (combining text and biological sequences)

Post-Training Process

Standard three phases:

  • Continuous Pre-training (CPT): Pre-training on biological text data to familiarize with domain terminology;
  • Supervised Fine-tuning (SFT): Training on annotated data tasks;
  • Reinforcement Learning (RL): Feedback-based optimization of model behavior.

Key Questions

  • How do each phase affect reasoning and generalization performance?
  • Is adding more training phases always better?
  • How to optimize phase allocation under limited budgets?
3

Section 03

Research Methods: Systematic Experimental Design with 100+ Models

Experiment Coverage

  • Model Scale & Architecture: Different general language models (Llama, Mistral), biological encoders, fusion strategies;
  • Training Phase Variants: CPT (data volume/learning rate/duration), SFT (task combinations/annotation volume/rounds), RL (reward functions/steps);
  • Evaluation Dimensions: In-domain (ID) and out-of-domain (OOD) performance across three fields: genomics, transcriptomics, and proteomics.

Research Hypotheses

  1. Each phase contributes differently;
  2. Post-training affects task performance and generalization ability;
  3. Resource allocation across phases needs optimization under fixed budgets.
4

Section 04

Core Findings: Differential Impacts of Post-Training Phases

Role of CPT

  • Align with Biological Language: Familiarizes with professional terms and establishes links between text and biological entities;
  • Performance Impact: Both ID and OOD performance improve with diminishing marginal returns, laying a solid foundation.

SFT's Double-Edged Sword

  • In-domain: Continuous improvement and task specialization;
  • Out-of-domain: First rises then falls (early transfer of general reasoning, later over-specialization);
  • Mechanism of Over-specialization: Over-adaptation to the training distribution leads to loss of generalization.

RL's Generalization Restoration

  • Key Effect: Improves OOD performance of strong SFT models;
  • Mechanism: Reward alignment corrects biases, explores solution spaces, and provides fine-grained feedback;
  • Applicable Conditions: Requires a strong SFT foundation, high-quality rewards, and appropriate training strategies.
5

Section 05

Optimal Strategy: Recommendations for Training Phase Allocation Under Budget Constraints

Budget Trade-off Strategies

  • Short SFT: Stop before OOD performance declines to avoid over-specialization;
  • Large RL Allocation: Fix over-specialization and improve generalization;
  • Asymmetric Adaptation: High learning rate for CPT, medium for SFT, low for RL.

Optimal Configuration Example

Phase Budget Ratio Key Parameters
CPT 20% High learning rate, extensive biological text
SFT 30% Medium learning rate, stop before peak
RL 50% Low learning rate, aligned reward function
6

Section 06

Biological Significance and Insights: Re-thinking Training Strategies

Reflections on Training Strategies

  • SFT is not越多越好 (more is not always better); excessive SFT harms generalization;
  • RL's value is underestimated—its role in restoring generalization is crucial;
  • Phases are interdependent, not independent.

Evaluation Criteria

  • Need to consider both ID and OOD performance to balance task-specific and generalization abilities;
  • Biological applications often face distribution shifts, so OOD performance is critical.

Cross-domain Transfer

The findings may apply to AI models in chemistry, materials science, and medicine.

7

Section 07

Limitations and Future Research Directions

Current Limitations

  1. Limited task scope (does not cover all biological tasks);
  2. Relatively controlled data scale—effects of super-large scales need verification;
  3. RL reward design relies on manual work, posing automation challenges.

Future Directions

  1. Dynamic training strategies (auto-detect over-specialization and adjust);
  2. Impact of multi-task learning on generalization;
  3. Theoretical models for post-training phase impacts;
  4. Cross-domain validation of universality.