Zing Forum

Reading

AROMA: A New Framework for Predicting Gene Perturbation in Virtual Cells by Integrating Multimodal Reasoning and Reinforcement Learning

AROMA is a multimodal virtual cell modeling framework accepted by ACL 2026. By integrating textual evidence, graph topological structures, and protein sequences, combined with retrieval-augmented strategies and GRPO reinforcement learning, it achieves high-precision prediction and interpretability analysis of gene perturbation effects.

虚拟细胞建模基因扰动预测多模态学习知识图谱强化学习GRPO计算生物学ACL2026AI4Science
Published 2026-04-23 14:13Recent activity 2026-04-23 14:53Estimated read 10 min
AROMA: A New Framework for Predicting Gene Perturbation in Virtual Cells by Integrating Multimodal Reasoning and Reinforcement Learning
1

Section 01

[Main Floor/Introduction] AROMA: A New Framework for Predicting Gene Perturbation in Virtual Cells by Integrating Multimodal Reasoning and Reinforcement Learning

AROMA is a multimodal virtual cell modeling framework accepted by ACL 2026. By integrating textual evidence, graph topological structures, and protein sequences, combined with retrieval-augmented strategies and GRPO reinforcement learning, it achieves high-precision prediction and interpretability analysis of gene perturbation effects. It aims to address pain points such as high cost and long cycle of traditional gene perturbation experiments, and promote the cross-integration of natural language processing and computational biology.

2

Section 02

Research Background and Core Challenges

In biomedical research, gene perturbation experiments are core methods to understand cell functions and disease mechanisms. However, traditional wet-lab experiments are costly and time-consuming, making it difficult to systematically explore the effects of massive gene combinations. As a cutting-edge direction in computational biology, virtual cell modeling can simulate cell responses to gene perturbations, reducing costs and accelerating drug target discovery. The field faces three major challenges:

  1. Data Heterogeneity: Gene function information is scattered across multimodal sources such as text literature, knowledge graphs, and protein sequences. A single modality cannot capture the complete biological context;
  2. Lack of Interpretability: Although black-box models can predict perturbation effects, they cannot provide causal explanations understandable to biologists;
  3. Limited Generalization Ability: The gene combinations covered by training data are limited, so models struggle to generalize to unseen perturbation scenarios. AROMA (Augmented Reasoning Over a Multimodal Architecture) is proposed to address these pain points and has been accepted by the ACL 2026 main conference.
3

Section 03

Technical Architecture: Data Construction and Multimodal Encoding

AROMA's technical architecture includes data construction and model reasoning phases:

Data Phase: Dual Knowledge Graph Construction

Construct two complementary biological knowledge graphs:

  • Gene-KG: Captures functional associations, regulatory relationships, and pathway memberships between genes;
  • Path-KG: Depicts the hierarchical structure of biological signaling pathways and cross-pathway interactions; At the same time, a large-scale virtual cell reasoning dataset PerturbReason is constructed to provide a foundation for evidence retrieval and reasoning.

Modeling Phase: Retrieval-Augmented Multimodal Encoding

When given a gene perturbation query:

  1. Retrieve Relevant Evidence: Retrieve relevant textual evidence from knowledge graphs and literature;
  2. Graph Neural Network Encoding: Use GNN to extract topological features from Gene-KG and Path-KG, capturing the structural role of genes in biological networks;
  3. Protein Sequence Encoding: Use the ESM-2 pre-trained model to encode protein sequences, capturing functional information at the amino acid level;
  4. Cross-Modal Attention Fusion: Explicitly model the dependency between perturbed genes and target genes across different modalities through a cross-attention module.

This design achieves an organic integration of 'neural-symbolic' approaches, combining symbolic knowledge reasoning and neural network representation learning capabilities.

4

Section 04

Technical Architecture: Training Optimization Strategy

AROMA adopts a two-stage training strategy to optimize the model:

First Stage: Multimodal Supervised Fine-Tuning (SFT)

Perform multimodal supervised learning on the PerturbReason dataset to learn the basic mapping from input queries to perturbation effect predictions, ensuring the model masters basic biological knowledge and prediction capabilities.

Second Stage: GRPO Reinforcement Learning Optimization

Introduce Group Relative Policy Optimization (GRPO) for reinforcement learning fine-tuning. GRPO optimizes the policy through intra-group relative reward signals, avoiding the unstable training problem of the critic model in traditional PPO algorithms. This stage not only improves prediction accuracy but also guides the model to generate biologically meaningful and interpretable reasoning processes, achieving dual optimization of 'performance-interpretability'.

5

Section 05

Experimental Validation and Open-Source Contributions

AROMA is fine-tuned based on the Qwen3-8B base model, making full use of the language understanding and generation capabilities of open-source large language models. The research team has fully open-sourced the following on the Hugging Face platform:

  • Model Weights: blazerye/AROMA;
  • Reasoning Dataset: blazerye/PerturbReason (full version);
  • Knowledge Graphs: Complete versions of Gene-KG and Path-KG. The comprehensive open-source strategy lowers the threshold for reproduction and provides valuable infrastructure for the computational biology community.
6

Section 06

Technical Significance and Future Outlook

Technical Significance

AROMA's insights for the AI for Science field:

  1. New Paradigm for Multimodal Fusion: Demonstrates the idea of unified modeling of text, graph structure, and sequence data, which can be extended to fields such as materials science and drug discovery;
  2. Practical Path for Interpretable AI: Provides a feasible solution for interpretable prediction in scientific fields through explicit evidence retrieval and structured knowledge integration;
  3. Application of Reinforcement Learning in Scientific Reasoning: The successful application of GRPO in biological reasoning tasks expands the application boundary of RLHF/RLAIF technologies in professional fields.

Future Outlook

With the popularization of single-cell sequencing technology and the development of spatial transcriptomics, virtual cell modeling is expected to integrate more refined cell state information. The AROMA architecture has good scalability and can further integrate emerging data modalities such as single-cell expression profiles and spatial location information, evolving towards the ultimate goal of 'digital twin cells'.