Reading

Using Cell Type-Aware Deep Learning to Predict Gene Enhancer Activity: Cross-Cell Modeling from Sequence to Function

This article introduces a deep learning study combining convolutional neural networks (CNNs), attention mechanisms, and cell type embeddings, exploring how to directly predict regulatory element activity from DNA sequences and evaluating the gain in prediction performance from cell type information.

深度学习生物信息学基因调控卷积神经网络注意力机制细胞类型嵌入ENCODEMPRA

Published 2026-06-14 16:14Recent activity 2026-06-14 16:22Estimated read 10 min

Using Cell Type-Aware Deep Learning to Predict Gene Enhancer Activity: Cross-Cell Modeling from Sequence to Function

Section 01

[Introduction] Study on Predicting Gene Enhancer Activity Using Cell Type-Aware Deep Learning

Core Research Content

This article introduces a deep learning study combining convolutional neural networks (CNN), attention mechanisms, and cell type embeddings, aiming to directly predict gene enhancer activity from DNA sequences and evaluate the gain in prediction performance from cell type information.

Research Sources

Original Author: Wojciech Laskowski
Source Platform: GitHub
Project Link: cell-type-aware-enhancer-prediction
Publication Date: June 14, 2026

Section 02

Research Background: Decoding the Black Box of Gene Regulation

In genomics, enhancers are key regulatory DNA sequences that can enhance the transcriptional activity of specific genes. However, predicting their activity across different cell types is a core challenge in computational biology.

Traditional experimental methods like Massively Parallel Reporter Assays (MPRA) have high accuracy but are costly and have limited throughput. In recent years, deep learning (especially CNN) has provided new ideas for predicting regulatory activity from sequences, but two key questions remain:

Can regulatory activity be directly predicted from DNA sequences?
Does introducing cell type-specific information improve prediction accuracy?

Section 03

Data Source: ENCODE Phase IV MPRA Experiments

The study's data comes from the ENCODE Project Phase IV MPRA experiments. The constructed joint library includes candidate enhancers, partial promoters, and control sequences. Activity was measured in three human cell lines:

HepG2: Liver cancer cell line (for liver regulation research)
K562: Chronic myeloid leukemia cell line (blood regulation model)
WTC11: Induced pluripotent stem cell-derived cardiomyocytes (developing heart tissue)

Each sample includes: one-hot encoded DNA sequence, cell type identifier, and log2(RNA/DNA) activity value. The dataset is divided into training set (70%), validation set (15%), and test set (15%) by regulatory element ID to avoid the same sequence across subsets.

Section 04

Model Architecture: Systematic Comparison of Four CNN Variants

The study designed four CNN variants to evaluate the impact of cell type information:

Baseline CNN: Uses only DNA sequence input, with multi-layer convolutional pooling to extract features, serving as the control group.
Embedding CNN: Adds learnable cell type embedding vectors to the baseline, concatenates them with sequence features before inputting to the regression layer, capturing interactions between cell types and sequences.
Attention CNN: A sequence-specific model that replaces global average pooling with attention pooling to identify key regions in the sequence.
Full CNN: The complete architecture combining cell type embeddings and attention pooling.

Section 05

Experimental Results: Cell Type Embedding Model Performs Best

Evaluation metrics include MSE, MAE, RMSE, Pearson correlation coefficient, Spearman correlation coefficient, and R².

Overall Performance: The embedding model performed best, with a comprehensive test set Pearson correlation coefficient of 0.489 and R² of 0.239; the attention model did not outperform the baseline; the full model was similar to the embedding model.

Cell Type Specificity: The WTC11 cell line performed best (Pearson 0.523, R² 0.271), possibly reflecting the characteristics of the cardiomyocyte regulatory network or differences in data quality.

Section 06

Key Findings: Value of Cell Type Embeddings and Limitations of Attention Mechanisms

Key Findings

Value of Cell Type Embeddings: Explicitly introducing cell type information can improve cross-cell prediction performance, capturing regulatory patterns that traditional sequence models struggle to learn.
Limitations of Attention Mechanisms: Using it alone did not improve performance, possibly because enhancer activity is determined by multiple scattered elements, current attention does not capture long-range dependencies, or the task is more suitable for global aggregation.
Underestimation of High-Activity Elements: The model tends to underestimate high-activity elements, possibly due to sparse extreme values in training data, loss function penalizing outliers, or insufficient model capacity.
Cell Type Performance Differences: Prediction accuracy varies across cell types, possibly related to the diversity of regulatory factors, sample distribution, or evolutionary conservation.

Section 07

Technical Implementation: Reproducible Snakemake Workflow

The project uses a Snakemake workflow to ensure reproducibility:

Execute the full workflow: snakemake --cores 1 (includes data preprocessing, model training, evaluation, and visualization)
Train a specific model alone: python src/train.py --variant [model variant] --epochs 30 (variants include baseline/embedding/attention/full)

The complete code and documentation are open-sourced, facilitating reproducibility or extension of the research.

Section 08

Research Significance and Future Directions

Research Significance

This study provides important insights for genomic regulation prediction:

Cross-cell regulation prediction requires explicit modeling of cell specificity rather than training a general model with mixed data.
Deep learning components need to be designed with biological properties in mind; the migration of attention mechanisms to genomics tasks is not straightforward.
Data quality and experimental design affect model performance, requiring a balance in cross-cell data distribution.
Predicting high-activity elements is a common challenge that requires specialized loss functions or data augmentation strategies.

Conclusion

Cell type-aware models represent an important advancement of deep learning in genomics, validating the effectiveness of cell type embeddings and revealing the limitations of attention mechanisms. The research results lay the foundation for building more accurate regulatory models and accelerating functional genomics research.

Project code and resources are available on GitHub.