Zing Forum

Reading

DPVT: A New Method for Phylogenetic Inference Based on Tree Traversal Neural Networks

A PyTorch project that combines deep learning with phylogenetics, using neural networks to traverse tree structures and predict edges in maximum parsimony trees, providing innovative ideas for phylogenetic inference in bioinformatics.

深度学习系统发育学生物信息学PyTorch神经网络进化树最大简约法计算生物学图神经网络Transformer
Published 2026-05-19 11:43Recent activity 2026-05-19 11:53Estimated read 7 min
DPVT: A New Method for Phylogenetic Inference Based on Tree Traversal Neural Networks
1

Section 01

[Introduction] DPVT: An Innovative Method for Phylogenetic Inference Empowered by Deep Learning

DPVT (Deep Phylogenetics Via Traversals) is a PyTorch project that combines deep learning with phylogenetics. It uses the TraverseNN neural network to traverse tree structures and predict edges in maximum parsimony trees, aiming to solve the computational complexity problem of traditional maximum parsimony methods when handling large-scale data, and providing a new path for phylogenetic inference in bioinformatics.

2

Section 02

Research Background and Problem Definition

The core task of phylogenetic inference is to reconstruct species evolutionary trees from DNA sequences. The maximum parsimony method assumes that evolution follows the principle of simplicity, but the search space grows exponentially with the number of species (n species correspond to (2n-5)!! unrooted binary trees), making the traversal of candidate trees by traditional algorithms computationally expensive. DPVT raises a core question: Can we train a neural network to determine which edges are likely to appear in maximum parsimony trees, thereby reducing the search space?

3

Section 03

Technical Architecture: TraverseNN Model Design

DPVT implements the TraverseNN module, leveraging the hierarchical characteristics of tree structures:

  1. Data Representation: Supports two dataset formats—TraversalDataset (serializes tree traversals into tensors, supports GPU acceleration, includes upward and downward traversal directions, and node features are learned via RNN) and TreeDataset (uses the ete3 library to preserve tree topology and attributes);
  2. Forward Propagation Process:
    • Traversal Learning: Information flows from leaf nodes to the root and back (similar to message passing);
    • Site Aggregation: Uses a Transformer encoder to aggregate cross-site information, and takes the average to get the final node features;
    • Classification Output: A linear layer plus Sigmoid outputs the probability of an edge being in the maximum parsimony tree (close to 0 means present, 1 means absent).
4

Section 04

Key Mechanisms and Training Strategies

  • Mutation Encoding: DNA base mutations are encoded using four-dimensional vectors (e.g., A→T is [-1,1,0,0]) to preserve directionality;
  • Symmetry Handling: The order of child nodes does not affect parent node features, ensuring consistent output when child nodes are swapped;
  • Training Strategy: Training/validation split of 0.8/0.2, balanced ratio of positive and negative samples (MP edges vs non-MP edges) to ensure the effectiveness of the classification task.
5

Section 05

Technical Implementation Details

  • Environment Configuration: Uses conda/mamba to manage dependencies, creates the environment via environment.yml, execute mamba env create -f environment.yml and pip install -e .;
  • Data Format: Training data is in pickle format, with dictionary keys as tree objects and values as label lists (0=MP edge, 1=non-MP edge, sorted by pre-order traversal);
  • GPU Acceleration: TraversalDataset is based on torch.tensor, supports efficient GPU operation, suitable for large-scale datasets.
6

Section 06

Application Value and Future Outlook

Potential applications of DPVT include:

  1. Accelerating phylogenetic inference: Prioritize searching for high-probability edges to reduce computation time;
  2. Guiding heuristic searches: Provide heuristic guidance for traditional methods like RAxML/IQ-TREE;
  3. Understanding evolutionary patterns: Neural network features may reveal hidden evolutionary laws;
  4. Expansion directions: Extend from maximum parsimony to more complex models like maximum likelihood or Bayesian inference.
7

Section 07

Summary

DPVT demonstrates an innovative direction for the integration of deep learning and traditional bioinformatics. Through tree traversal mechanisms and mutation encoding, TraverseNN can predict important edges in phylogenetic trees, combining computational efficiency advantages with biological analysis value, and providing a reference case for researchers in bioinformatics, computational biology, or graph neural networks.