Reading

Protein Large Language Models Facilitate Cross-Species Single-Cell Transcriptome Integration

This project uses protein large language models (ESM2) to achieve cross-species gene homology mapping, providing a complete workflow with five different strategies for cross-species integration of single-cell transcriptome data.

protein language modelESM2single-celltranscriptomicscross-speciesgene homologuebioinformatics

Published 2026-05-27 22:14Recent activity 2026-05-27 22:21Estimated read 7 min

Section 01

Introduction: Protein Large Language Models Facilitate Cross-Species Single-Cell Transcriptome Integration

Core Overview

This project was developed by KKzhongyi and released on GitHub (original title: pLLM-cross-species-integration, link: https://github.com/KKzhongyi/pLLM-cross-species-integration, release date: 2026-05-27). Its core is to use the protein large language model ESM2 to achieve cross-species gene homology mapping, providing a complete workflow with 5 different strategies to solve the problem of gene naming differences in cross-species integration of single-cell transcriptome data. Keywords: protein language model, ESM2, single-cell, transcriptomics, cross-species, gene homologue, bioinformatics

Section 02

Research Background and Challenges

Single-cell transcriptome sequencing technology has developed rapidly, but gene naming differences between different species have become a major obstacle to cross-species comparison. Traditional homology mapping relies on orthologous information from databases like Ensembl, which has limitations: it only based on sequence similarity and cannot capture functional similarity at the protein level; moreover, one-to-many or many-to-many mappings are common in databases, which do not meet the requirement for one-to-one mapping in single-cell analysis.

Section 03

Role of the Protein Language Model ESM2

Protein language models (pLM) capture structural and functional information of protein sequences through unsupervised learning. ESM2, developed by Meta AI, has parameter sizes ranging from 8 billion to 15 billion. Its core is to treat protein sequences as language and amino acids as words, learning structural and functional patterns through masked language modeling. The generated embedding vectors can reflect the structural and functional correlations of proteins.

Section 04

Comparison of Five Homology Mapping Strategies

Five Homology Mapping Strategies

The project implements 5 strategies:

ENS_M2M: Directly download many-to-many mappings from Ensembl; complete information but cannot be directly used for single-cell analysis;
ENS_O2O: Filter one-to-one orthologous relationships from Ensembl; simple and effective but loses potential homologous pairs;
HM_O2O: Greedy selection based on Ensembl attributes (sequence identity, confidence) to obtain conflict-free one-to-one mappings;
LM_O2O: Core innovation, using ESM2_150B to generate embedding vectors, via bidirectional best hit (DBH) + greedy algorithm, supporting average, max pooling of isoforms or selection of canonical isoforms;
HL_O2O: Mix results from HM_O2O and LM_O2O, weighted integration of multi-dimensional scores.

Section 05

End-to-End Workflow and Technical Details

The workflow includes three steps: generating dataset-level homology tables, performing cross-species integration (UMAP dimensionality reduction + CCA batch correction), and evaluating integration performance (clustering metrics like average silhouette width (ASW) and cell type mixing degree). Tech stack: Using ESM2_150B model to generate embeddings, results stored in HDF5 format (compatible with Scanpy/Seurat), data uploaded to Zenodo to ensure reproducibility.

Section 06

Application Case: Cross-Species Analysis of Islet Cells

Taking pancreatic islet cells (a key cell type in diabetes research) as an example, integrate single-cell data from humans, mice, and pigs, identify conserved cell type marker genes, discover species-specific expression patterns, and provide support for translational medicine.

Section 07

Research Significance and Future Outlook

This project demonstrates the practical value of pLM in computational biology, breaking through the limitations of traditional sequence alignment. Strategy selection reference: Choose ENS_M2M for maximum coverage; choose HM_O2O/HL_O2O for high-confidence conserved relationships; choose LM_O2O for genes with functional similarity but large sequence divergence. In the future, the integration of pLM and single-cell technology will become an important trend in bioinformatics.

Protein Large Language Models Facilitate Cross-Species Single-Cell Transcriptome Integration

Introduction: Protein Large Language Models Facilitate Cross-Species Single-Cell Transcriptome Integration

Core Overview

Research Background and Challenges

Research Background and Challenges

Role of the Protein Language Model ESM2

Role of the Protein Language Model ESM2

Comparison of Five Homology Mapping Strategies

Five Homology Mapping Strategies

End-to-End Workflow and Technical Details

End-to-End Workflow and Technical Details

Application Case: Cross-Species Analysis of Islet Cells

Application Case: Cross-Species Analysis of Islet Cells

Research Significance and Future Outlook

Research Significance and Future Outlook

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Building an Enterprise-Grade Real-Time MLOps Platform: A Complete Practice from Automated Training to Continuous Deployment

The 'Eureka' Phenomenon in Neural Networks: A Deep Analysis and Visual Exploration of Grokking