Zing Forum

Reading

Epi-PRS: Precise Polygenic Disease Risk Prediction Using Genomic Large Language Models

The Epi-PRS method developed by the Stanford University team innovatively applies genomic large language models (such as Enformer) to polygenic risk scores, achieving more precise disease risk prediction by extracting functional features from individual genomes.

多基因风险评分基因组大语言模型Enformer疾病预测精准医学表观遗传学斯坦福大学迁移学习
Published 2026-06-17 02:11Recent activity 2026-06-17 02:19Estimated read 6 min
Epi-PRS: Precise Polygenic Disease Risk Prediction Using Genomic Large Language Models
1

Section 01

[Introduction] Epi-PRS: A New Method for Precise Polygenic Disease Risk Prediction Driven by Genomic Large Language Models

The Epi-PRS method developed by the Wong Lab at Stanford University innovatively applies genomic large language models (such as Enformer) to polygenic risk scores (PRS). By extracting functional features from individual genomes, it addresses limitations of traditional PRS, such as reliance on statistical associations and neglect of functional context, enabling more precise disease risk prediction. This method integrates biological knowledge and provides a new tool for precision medicine.

2

Section 02

Research Background and Challenges of Traditional PRS

Polygenic Risk Score (PRS) is a core tool for assessing genetic susceptibility to complex diseases, but traditional PRS has limitations: it only relies on statistical associations and ignores the functional context of genetic variations (e.g., gene expression, epigenetic regulation); the mechanism of non-coding region variations is difficult to interpret; prediction accuracy varies greatly among different populations; and it cannot capture the complexity of gene regulatory networks.

3

Section 03

Core Innovations of Epi-PRS

The core of Epi-PRS lies in using genomic large language models (gLLMs) to extract functional features from individual genomes. The human genome is a structured "language", and gLLMs like Enformer have been trained on massive data to predict molecular phenotypes such as gene expression and chromatin accessibility. Epi-PRS converts raw DNA sequences into high-dimensional functional features, integrating biological knowledge into risk prediction.

4

Section 04

Technical Implementation Process of Epi-PRS

Epi-PRS is divided into three stages: 1. Individual genome construction: Remove indels from VCF files to retain SNPs, then phase to construct paternal/maternal haplotypes; 2. Feature extraction: Use Enformer to process haplotype sequences and extract molecular features across cell lines/tissues (e.g., gene expression, chromatin accessibility); 3. Risk modeling: After PCA dimensionality reduction, use logistic regression/elastic net to calculate risk scores, with an 80-20 training-test split.

5

Section 05

Advantages and Potential Impact of Epi-PRS

The advantages of Epi-PRS include: 1. Transfer learning: Regulatory rules from pre-trained gLLMs can be applied to new tasks, performing well even with limited samples; 2. Cross-population generalization: Based on functional genomics principles, it reduces Eurocentric bias; 3. Interpretability: Features are derived from clear molecular phenotypes, allowing traceability of regulatory mechanisms and guiding drug target discovery.

6

Section 06

Technical Dependencies and Usage Thresholds

Epi-PRS depends on Python 3.9, TensorFlow 2.8, TensorFlow Hub 0.11, and Java JDK 1.8; Enformer inference requires substantial computing resources. To use it, users need to prepare VCF genotype data, reference genomes, and phenotype labels, and must have bioinformatics experience. The project repository provides step-by-step instructions and example scripts.

7

Section 07

Limitations and Future Directions

Epi-PRS has limitations: it currently relies on the Enformer model, so newer models need to be explored; the integration strategy for paternal/maternal genome information can be optimized; large-scale clinical validation is required. Future directions include trying new gLLMs, using more complex architectures (e.g., GNN) to capture allele interactions, and advancing clinical translation evaluation.

8

Section 08

Conclusion: The Potential of AI and Genomics Integration

Epi-PRS demonstrates the value of deep integration between AI and genomics. While improving prediction accuracy, it opens up new avenues for understanding the molecular mechanisms of diseases. With the evolution of gLLMs and the popularization of computing resources, such methods are expected to be applied in more areas of precision medicine, benefiting a wider range of patient populations.