Abstract
Background: Nuclear genomic DNA plays a crucial role in individual development and phenotype determination. The genetic landscape within populations exhibits significant heterogeneity, contributing to diverse human traits. Current studies of human genome heterogeneity often focus on specific segments of high-frequency phenotype-associated sequences or structurally complex regions. Therefore, to overcome the limitations of previous studies and more directly explore population heterogeneity, it is essential to study the entire genome rather than focusing only on known phenotype-associated regions.
Results: Using set theory, we have clearly defined Complex Regions (Complex_Region) by integrating pan-genome datasets, covering about 8.1% of the human genome. These regions exhibit high sequence diversity and nonrandom long continuous fragments (≥450kb), thus reflecting population genetic complexity. Our enrichment analysis revealed that genes within Complex_Region are primarily involved in immunity and metabolism, indicating chromosome-specific functional enrichment. Notably, immune genes are mainly located on chromosomes 6 and 19, which are closely associated with disease occurrence. Moreover, these regions are enriched for human phenotype-related signals and tumor somatic mutations, providing novel insights for large-scale cohort studies. We also detected ancient viral sequences, particularly ~9.47 kb human endogenous retroviruses (HERV) insertion sequence NC_022518, which is diverse in humans but remains conserved across primates, to be implicated in regulating bodily functions and various diseases.
Conclusions: Our study highlights the biomedical importance of Complex_Region by revealing associations among genotypes, environment, and phenotypes. This enhances our understanding of life regulation and phenotype shaping, highlighting the role of these regions in immunity, metabolism, and disease association.