Zing Forum

Reading

Epi-PRS: Precise Polygenic Risk Prediction Using Genomic Large Language Models

This article introduces the Epi-PRS project, an innovative polygenic risk scoring method that converts individual DNA sequences into personalized genomic and epigenomic features using genomic large language models, providing new insights for disease risk modeling.

多基因风险评分基因组学大语言模型Enformer精准医学疾病风险预测深度学习GWAS
Published 2026-06-17 02:11Recent activity 2026-06-17 02:25Estimated read 6 min
Epi-PRS: Precise Polygenic Risk Prediction Using Genomic Large Language Models
1

Section 01

Epi-PRS: Precise Polygenic Risk Prediction Using Genomic Large Language Models (Introduction)

Original Author/Maintainer: SUwonglab Source Platform: GitHub Original Link: https://github.com/SUwonglab/Epi-PRS Publication Time: 2026-06-16T18:11:51Z

Epi-PRS is an innovative polygenic risk scoring method. It converts individual DNA sequences into personalized genomic and epigenomic features using genomic large language models (such as DeepMind's Enformer), addressing limitations of traditional PRS methods like reliance on statistical associations and neglect of gene regulatory mechanisms, thus providing new ideas for disease risk modeling.

2

Section 02

Project Background: Limitations of Traditional PRS Methods

Traditional PRS is based on GWAS summary statistics and has the following limitations:

  1. Linear assumption limitation: Ignores gene-gene interactions and non-linear effects;
  2. Lack of functional annotation: Difficult to utilize information from non-coding regulatory variants;
  3. Population bias: Reduced prediction accuracy in non-European populations. Epi-PRS attempts to mitigate these issues through deep learning models.
3

Section 03

Core Technologies and Method Workflow

Core Technology: Enformer Model Enformer is a Transformer architecture model developed by DeepMind, which can predict molecular phenotypes of DNA sequences. Its features include:

  • Accepts sequence input of up to 196608 base pairs;
  • Multi-task prediction of 5313 molecular phenotypes;
  • Captures long-range sequence dependencies.

Epi-PRS Workflow:

  1. Individual genomic feature extraction: Extract Enformer features from DNA sequences of target regions;
  2. Epigenomic feature engineering: Cross-cell type aggregation, functional region weighting, dimensionality reduction;
  3. Risk prediction model training: Train prediction models using linear models, elastic net, or gradient boosting trees.
4

Section 04

Technical Advantages and Application Scenarios

Technical Advantages:

  • Biological interpretability: Features correspond to clear molecular phenotypes;
  • Utilizes non-coding variants: Covers 98% of the non-coding genome;
  • Integrates rare variants: Learns from complete sequences;
  • Cross-population generalization: Trained on diverse data.

Application Scenarios:

  • Disease risk stratification: E.g., early screening for individuals at high risk of breast cancer;
  • Pharmacogenomics: Guides personalized medication;
  • Complex disease research: Identifies risk genes and regulatory pathways. The project repository includes an example of breast cancer risk prediction.
5

Section 05

Limitations and Challenges

Epi-PRS faces the following challenges:

  1. High computational cost: Enformer inference requires a lot of resources;
  2. Large feature dimension: Prone to overfitting;
  3. Causal inference problem: Only identifies statistical associations rather than causal effects;
  4. Model update requirement: Needs to update reference genome and cell type information with new data.
6

Section 06

Future Development Directions

Future directions of Epi-PRS:

  1. More powerful base models: Support longer sequences and more prediction tasks;
  2. Multi-omics integration: Combine transcriptome, proteome, and other data;
  3. Causal inference methods: Distinguish between correlation and causation;
  4. Clinical translation research: Validate clinical utility and cost-effectiveness.
7

Section 07

Conclusion

Epi-PRS integrates deep learning and genomics, opening up a new path for polygenic risk prediction. It not only improves prediction accuracy but also enhances the understanding of the genetic mechanisms of diseases. With the accumulation of data and improvement of computing power, such methods will play an important role in precision medicine, providing more accurate risk assessment tools for researchers and clinicians.