# How Post-Training Shapes Biological Reasoning Models: Differential Impacts of Training Phases on Generalization Ability

> By constructing and evaluating over 100 biological reasoning models, the study reveals the differential impacts of post-training phases on generalization ability: continuous pre-training aligns with biological language; supervised fine-tuning improves in-domain performance but causes out-of-domain performance to first rise then fall; reinforcement learning restores generalization ability. The study shows that biological reasoning performance does not increase monotonically with the amount of supervision.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-15T10:19:49.000Z
- 最近活动: 2026-06-16T03:03:58.083Z
- 热度: 134.3
- 关键词: 生物推理模型, 后训练, 持续预训练, 监督微调, 强化学习, 泛化能力, 过特化, ID-OOD权衡
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2606-16517v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2606-16517v1
- Markdown 来源: floors_fallback

---

## 【Introduction】How Post-Training Shapes Biological Reasoning Models: Core Findings and Significance

### Research Theme
Differential impacts of post-training phases on the generalization ability of biological reasoning models

### Core Conclusions
By constructing and evaluating over 100 biological reasoning models, the study reveals:
- Continuous Pre-training (CPT) aligns with biological language, improving both in-domain (ID) and out-of-domain (OOD) performance;
- Supervised Fine-tuning (SFT) improves in-domain performance but leads to out-of-domain performance first rising then falling (over-specialization);
- Reinforcement Learning (RL) restores generalization ability;
- Biological reasoning performance does not increase monotonically with the amount of supervision.

### Source Information
- Original Author/Team: Bioinformatics and AI Research Team
- Source Platform: arXiv
- Publication Date: 2026-06-15
- Original Link: http://arxiv.org/abs/2606.16517v1

## Research Background: Post-Training Dilemmas and Key Questions in Biological AI

## Transformation of Biological AI
Biological science is undergoing an AI-driven revolution—from protein structure prediction to disease diagnosis, AI models are reshaping all aspects of research.

## Typical Architectures
Current biological reasoning model architectures:
1. Foundation Language Models (general language understanding)
2. Biological Foundation Models (pre-trained encoders for biological sequences)
3. Multimodal Fusion (combining text and biological sequences)

## Post-Training Process
Standard three phases:
- Continuous Pre-training (CPT): Pre-training on biological text data to familiarize with domain terminology;
- Supervised Fine-tuning (SFT): Training on annotated data tasks;
- Reinforcement Learning (RL): Feedback-based optimization of model behavior.

## Key Questions
- How do each phase affect reasoning and generalization performance?
- Is adding more training phases always better?
- How to optimize phase allocation under limited budgets?

## Research Methods: Systematic Experimental Design with 100+ Models

## Experiment Coverage
- **Model Scale & Architecture**: Different general language models (Llama, Mistral), biological encoders, fusion strategies;
- **Training Phase Variants**: CPT (data volume/learning rate/duration), SFT (task combinations/annotation volume/rounds), RL (reward functions/steps);
- **Evaluation Dimensions**: In-domain (ID) and out-of-domain (OOD) performance across three fields: genomics, transcriptomics, and proteomics.

## Research Hypotheses
1. Each phase contributes differently;
2. Post-training affects task performance and generalization ability;
3. Resource allocation across phases needs optimization under fixed budgets.

## Core Findings: Differential Impacts of Post-Training Phases

### Role of CPT
- **Align with Biological Language**: Familiarizes with professional terms and establishes links between text and biological entities;
- **Performance Impact**: Both ID and OOD performance improve with diminishing marginal returns, laying a solid foundation.

### SFT's Double-Edged Sword
- **In-domain**: Continuous improvement and task specialization;
- **Out-of-domain**: First rises then falls (early transfer of general reasoning, later over-specialization);
- **Mechanism of Over-specialization**: Over-adaptation to the training distribution leads to loss of generalization.

### RL's Generalization Restoration
- **Key Effect**: Improves OOD performance of strong SFT models;
- **Mechanism**: Reward alignment corrects biases, explores solution spaces, and provides fine-grained feedback;
- **Applicable Conditions**: Requires a strong SFT foundation, high-quality rewards, and appropriate training strategies.

## Optimal Strategy: Recommendations for Training Phase Allocation Under Budget Constraints

## Budget Trade-off Strategies
- **Short SFT**: Stop before OOD performance declines to avoid over-specialization;
- **Large RL Allocation**: Fix over-specialization and improve generalization;
- **Asymmetric Adaptation**: High learning rate for CPT, medium for SFT, low for RL.

## Optimal Configuration Example
| Phase | Budget Ratio | Key Parameters |
|------|----------|----------|
| CPT | 20% | High learning rate, extensive biological text |
| SFT | 30% | Medium learning rate, stop before peak |
| RL | 50% | Low learning rate, aligned reward function |

## Biological Significance and Insights: Re-thinking Training Strategies

### Reflections on Training Strategies
- SFT is not越多越好 (more is not always better); excessive SFT harms generalization;
- RL's value is underestimated—its role in restoring generalization is crucial;
- Phases are interdependent, not independent.

### Evaluation Criteria
- Need to consider both ID and OOD performance to balance task-specific and generalization abilities;
- Biological applications often face distribution shifts, so OOD performance is critical.

### Cross-domain Transfer
The findings may apply to AI models in chemistry, materials science, and medicine.

## Limitations and Future Research Directions

## Current Limitations
1. Limited task scope (does not cover all biological tasks);
2. Relatively controlled data scale—effects of super-large scales need verification;
3. RL reward design relies on manual work, posing automation challenges.

## Future Directions
1. Dynamic training strategies (auto-detect over-specialization and adjust);
2. Impact of multi-task learning on generalization;
3. Theoretical models for post-training phase impacts;
4. Cross-domain validation of universality.