# DPVT: A New Method for Phylogenetic Inference Based on Tree Traversal Neural Networks

> A PyTorch project that combines deep learning with phylogenetics, using neural networks to traverse tree structures and predict edges in maximum parsimony trees, providing innovative ideas for phylogenetic inference in bioinformatics.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-19T03:43:39.000Z
- 最近活动: 2026-05-19T03:53:18.534Z
- 热度: 154.8
- 关键词: 深度学习, 系统发育学, 生物信息学, PyTorch, 神经网络, 进化树, 最大简约法, 计算生物学, 图神经网络, Transformer
- 页面链接: https://www.zingnex.cn/en/forum/thread/dpvt
- Canonical: https://www.zingnex.cn/forum/thread/dpvt
- Markdown 来源: floors_fallback

---

## [Introduction] DPVT: An Innovative Method for Phylogenetic Inference Empowered by Deep Learning

DPVT (Deep Phylogenetics Via Traversals) is a PyTorch project that combines deep learning with phylogenetics. It uses the TraverseNN neural network to traverse tree structures and predict edges in maximum parsimony trees, aiming to solve the computational complexity problem of traditional maximum parsimony methods when handling large-scale data, and providing a new path for phylogenetic inference in bioinformatics.

## Research Background and Problem Definition

The core task of phylogenetic inference is to reconstruct species evolutionary trees from DNA sequences. The maximum parsimony method assumes that evolution follows the principle of simplicity, but the search space grows exponentially with the number of species (n species correspond to (2n-5)!! unrooted binary trees), making the traversal of candidate trees by traditional algorithms computationally expensive. DPVT raises a core question: Can we train a neural network to determine which edges are likely to appear in maximum parsimony trees, thereby reducing the search space?

## Technical Architecture: TraverseNN Model Design

DPVT implements the TraverseNN module, leveraging the hierarchical characteristics of tree structures:
1. **Data Representation**: Supports two dataset formats—TraversalDataset (serializes tree traversals into tensors, supports GPU acceleration, includes upward and downward traversal directions, and node features are learned via RNN) and TreeDataset (uses the ete3 library to preserve tree topology and attributes);
2. **Forward Propagation Process**: 
   - Traversal Learning: Information flows from leaf nodes to the root and back (similar to message passing);
   - Site Aggregation: Uses a Transformer encoder to aggregate cross-site information, and takes the average to get the final node features;
   - Classification Output: A linear layer plus Sigmoid outputs the probability of an edge being in the maximum parsimony tree (close to 0 means present, 1 means absent).

## Key Mechanisms and Training Strategies

- **Mutation Encoding**: DNA base mutations are encoded using four-dimensional vectors (e.g., A→T is [-1,1,0,0]) to preserve directionality;
- **Symmetry Handling**: The order of child nodes does not affect parent node features, ensuring consistent output when child nodes are swapped;
- **Training Strategy**: Training/validation split of 0.8/0.2, balanced ratio of positive and negative samples (MP edges vs non-MP edges) to ensure the effectiveness of the classification task.

## Technical Implementation Details

- **Environment Configuration**: Uses conda/mamba to manage dependencies, creates the environment via environment.yml, execute `mamba env create -f environment.yml` and `pip install -e .`;
- **Data Format**: Training data is in pickle format, with dictionary keys as tree objects and values as label lists (0=MP edge, 1=non-MP edge, sorted by pre-order traversal);
- **GPU Acceleration**: TraversalDataset is based on torch.tensor, supports efficient GPU operation, suitable for large-scale datasets.

## Application Value and Future Outlook

Potential applications of DPVT include:
1. Accelerating phylogenetic inference: Prioritize searching for high-probability edges to reduce computation time;
2. Guiding heuristic searches: Provide heuristic guidance for traditional methods like RAxML/IQ-TREE;
3. Understanding evolutionary patterns: Neural network features may reveal hidden evolutionary laws;
4. Expansion directions: Extend from maximum parsimony to more complex models like maximum likelihood or Bayesian inference.

## Summary

DPVT demonstrates an innovative direction for the integration of deep learning and traditional bioinformatics. Through tree traversal mechanisms and mutation encoding, TraverseNN can predict important edges in phylogenetic trees, combining computational efficiency advantages with biological analysis value, and providing a reference case for researchers in bioinformatics, computational biology, or graph neural networks.
