Zing Forum

Reading

Forensic Ancestry Inference: A Benchmark Study Based on SNP Panels and Machine Learning

This study explores how to use five ancestry-informative SNP markers and machine learning algorithms to accurately infer continental-level ancestry from degraded DNA samples, providing proof of concept for forensic applications.

法医遗传学祖先推断SNP机器学习群体遗传学千人基因组DNA降解分类算法
Published 2026-06-11 06:16Recent activity 2026-06-11 06:20Estimated read 4 min
Forensic Ancestry Inference: A Benchmark Study Based on SNP Panels and Machine Learning
1

Section 01

Introduction: Proof of Concept Study on Forensic Ancestry Inference

This study explores the use of five ancestry-informative SNP markers and machine learning algorithms to accurately infer continental-level ancestry from degraded DNA samples. It validates the feasibility of a minimal SNP panel based on 1000 Genomes Project data, providing proof of concept for forensic applications.

2

Section 02

Background: DNA Challenges in Forensic Science and Solutions with AISNPs

In forensic practice, DNA samples from crime scenes often face challenges of low quantity and degradation. Traditional STR analysis has strong individual identification capabilities but limited ancestry information. Ancestry-informative SNPs (AISNPs) show significant frequency differences among continental populations and can infer ancestry with a small number of markers. This study explores the feasibility of minimizing the SNP panel.

3

Section 03

Study Design: Data Sources and Five-Marker AISNP Panel

Data were obtained from 2504 individuals in Phase 3 of the 1000 Genomes Project, divided into five continental populations: AFR (Africa), AMR (Admixed Americas), EAS (East Asia), EUR (Europe), and SAS (South Asia). Five AISNP markers validated by population genetics were carefully selected, including rs2814778 (African ancestry), rs3827760 (East Asian ancestry), etc.

4

Section 04

Analysis Methods: Genotype Analysis and Machine Learning Classification

The frequency of each SNP in the population was calculated (e.g., rs2814778 is Africa-specific). PCA with the five markers captured 80.3% of genetic variation, showing clear population clustering. Four machine learning models were evaluated, with SVM achieving the highest accuracy (91.2%), while the accuracy for the Admixed Americas population was lower.

5

Section 05

Key Findings: Feature Importance and Degradation Robustness

Random forest feature importance showed that rs2814778 was the most informative. Progressive SNP deletion experiments indicated that classification performance remained robust under moderate deletion, highlighting the forensic value of high-information markers.

6

Section 06

Limitations and Future Research Directions

Limitations: Only five markers were evaluated, and subcontinental population structure was not addressed. Future plans: Expand to the Kidd55 panel, evaluate ensemble models, simulate DNA degradation scenarios, validate with independent datasets, etc.

7

Section 07

Practical Significance: Implications for Forensic Applications

It was validated that a minimal SNP panel can reproduce continental population structure. When samples are limited or degraded, a small number of carefully selected markers can still provide ancestry clues, offering additional support for case investigations.