Omics Data Revolution in the Precision Medicine Era
With the development of high-throughput sequencing technology, biomedicine has entered the era of omics big data. Multi-level omics data provides new dimensions for understanding disease mechanisms and predicting risks. This project focuses on transcriptomic RNA-seq data to explore the association between gene expression and disease states.
RNA-seq Technical Principles and Data Characteristics
RNA-seq obtains RNA sequence information through high-throughput sequencing and quantifies gene expression. Compared to microarrays, it is more sensitive and has a wider dynamic range. Data characteristics: high-dimensional sparsity (tens of thousands of genes, some active), batch effects (need correction), negative binomial distribution (special statistical processing), high-dimensional small samples (few samples, many features).
TCGA Database and Breast Cancer Dataset
TCGA is an important public resource for cancer research, containing multi-omics data for 33 types of cancer. Breast cancer (BRCA) has the largest sample size, including multi-dimensional data such as gene expression, clinical phenotypes, genomic variations, and methylation, providing rich features for prediction models.