Zing Forum

Reading

DTI-LLM: A Large Language Model-Based Framework for Drug-Target Interaction Prediction

This article introduces DTI-LLM, an open-source project that leverages the reasoning capabilities of large language models (LLMs) to predict interactions between drugs and target proteins. By integrating multi-dimensional features such as protein-protein interaction scores, sequence similarity, and embedding similarity, along with three prompting strategies—direct prediction, chain-of-thought (CoT), and synthetic reasoning—this project provides an interpretable AI solution for the field of drug discovery.

药物发现大语言模型DTI预测生物信息学机器学习LoRA微调可解释AI蛋白质相互作用
Published 2026-06-12 15:37Recent activity 2026-06-12 15:50Estimated read 10 min
DTI-LLM: A Large Language Model-Based Framework for Drug-Target Interaction Prediction
1

Section 01

Introduction: Core Overview of the DTI-LLM Open-Source Project

DTI-LLM is an open-source project developed by NimishaGhosh (GitHub link: https://github.com/NimishaGhosh/DTI-LLM, released in June 2026) that aims to use the reasoning capabilities of large language models (LLMs) to predict drug-target interactions (DTI). By integrating multi-dimensional features such as protein-protein interaction scores, sequence similarity, and embedding similarity, along with three prompting strategies—direct prediction, chain-of-thought (CoT), and synthetic reasoning—this project provides an interpretable AI solution for the field of drug discovery, helping to accelerate candidate drug screening and reduce R&D costs.

2

Section 02

Project Background and Core Challenges

The drug discovery process is long and expensive, with traditional methods requiring years of time and billions of dollars in investment. As a core task in computational drug discovery, DTI prediction can significantly accelerate candidate drug screening. However, DTI prediction faces multiple challenges: the complex structures of drug molecules and proteins lead to high computational costs for traditional methods; the heterogeneity of biological systems results in large differences in binding properties of the same drug on different proteins; data scarcity and class imbalance restrict model performance. Recent breakthroughs of LLMs in the NLP field have provided new ideas for solving DTI prediction problems, and DTI-LLM is exactly an application attempt of LLMs in the biomedical field.

3

Section 03

Technical Architecture: Multi-Dimensional Features and Prompting Strategies

The core innovations of DTI-LLM lie in feature engineering and prompting strategies:

  1. Multi-dimensional Feature Integration:
    • Protein-Protein Interaction (PPI) Score: Reflects the interaction strength between the target protein and known drug targets, using the "guilt-by-association" principle;
    • Sequence Similarity: Draws on prior knowledge of known drug-target pairs—proteins with similar sequences have more similar functions and structures;
    • Embedding Similarity: Generates embeddings via pre-trained biological language models to capture high-level semantic information, including embedding similarity between proteins and between drugs and proteins.
  2. Three Prompting Strategies:
    • Direct Prediction: Takes features and directly outputs binary results, efficient for rapid screening;
    • Chain-of-Thought (CoT): Shows the logical chain through step-by-step reasoning, improving interpretability and accuracy;
    • Synthetic Reasoning: Automatically generates natural language reasoning text as a supervision signal during training, teaching the model to explain prediction basis like an expert.
4

Section 04

Model Implementation Details

Model implementation details:

  • Quantization and LoRA Fine-tuning: Uses 4-bit quantization to reduce memory usage, and performs parameter-efficient fine-tuning via LoRA technology—only training low-rank adapter parameters, allowing consumer-grade GPUs to fine-tune large LLMs;
  • Multi-Model Support: The code architecture is compatible with mainstream open-source LLMs such as Qwen, Mistral, and LLaMA series, with a unified configuration interface for easy model switching;
  • Flexible Feature Modes: Provides six feature modes (all, ppi_only, seq_only, no_emb, no_ppi, no_seq) to support ablation experiments evaluating the contribution of different feature subsets.
5

Section 05

Practical Application Workflow

The workflow for using DTI-LLM is concise:

  1. Data Preparation: Prepare Parquet files (train_with_emb.parquet and test_with_emb.parquet) containing precomputed features such as PPI scores, sequence similarity, and embedding similarity;
  2. Parameter Configuration: Specify the base model path, output directory, prompt style, and feature mode via the command line, supporting multi-random-seed experiments to ensure robust results;
  3. Evaluation: Provides evaluate.py and evaluate_SR.py scripts to calculate metrics like accuracy, precision, recall, and F1 score, and supports synthetic reasoning quality assessment.
6

Section 06

Technical Significance and Potential Impact

Technical significance and potential impact of DTI-LLM:

  • Improved Interpretability: Provides natural language explanations through chain-of-thought and synthetic reasoning, helping researchers understand the basis of predictions;
  • Data Efficiency: Uses pre-trained LLM knowledge and feature engineering to achieve good performance with limited labeled data;
  • Modularity and Extensibility: Clear code structure, making it easy to replace feature extractors, try new prompting strategies, or integrate additional biological data sources;
  • Open-Source Ecosystem Contribution: Serves as an open-source project providing reproducible benchmarks, promoting collective progress in the field.
7

Section 07

Limitations and Future Directions

Current limitations and future directions of DTI-LLM:

  • Limitations: The README document is brief, lacking detailed performance benchmarks, dataset descriptions, and pre-trained model download links; only code implementation is provided, with no pre-trained weights or large-scale experimental results, creating a barrier for researchers with limited computational resources;
  • Future Directions: Integrate more biological features such as 3D structural information and gene expression data; explore larger-scale base models; develop interactive visualization tools to display the reasoning process; closely integrate with experimental validation workflows.
8

Section 08

Conclusion: Value and Outlook of DTI-LLM

DTI-LLM is an innovative open-source project that introduces LLM reasoning capabilities into DTI prediction tasks. Through multi-dimensional feature integration and interpretable prompting strategies, it provides a promising technical route for computational drug discovery. With the project's development and community participation, it is expected to play an active role in accelerating new drug discovery and reducing R&D costs. For researchers and developers interested in AI-driven drug discovery, DTI-LLM is worth paying attention to.