An Integrative Bioinformatics Pipeline for NHANES Data Processing for Machine Learning Analysis of Oral Health Outcomes

Author:

Orlenko Alena1,Mure Justin D2,Gluch Joan I2,Gregg John3,Compher Charlene W4,Koo Hyun2,Moore Jason H1

Affiliation:

1. Cedars-Sinai Medical Center

2. University of Pennsylvania School of Dental Medicine

3. University of Pennsylvania

4. University of Pennsylvania, Hospital of the University of Pennsylvania

Abstract

Abstract Large database sources, such as the National Health and Nutrition Examination Survey (NHANES), while being a great utility for epidemiological studies, pose challenges for machine learning due to data heterogeneity, varied sample sizes, missing values/outliers and variations in data collection and interpretation requiring thorough data-quality assessment and cleaning. In addition, complex disease outcomes often display a high degree of clinical heterogeneity, necessitating deeper phenotypic subtyping. Here, we develop an integrated data cleaning-subtype discovery pipeline with unsupervised learning algorithms for comprehensive analysis and network-based/clustering visualization of data patterns and data outcomes. We apply this pipeline to NHANES, one of the largest curated repositories of population-level health-related indicators which includes a physical examination, blood biochemistry, self-reported surveys, and dietary intake data. We focus our investigations on dental caries which remains the most prevalent chronic disease affecting more than 3.5 billion people worldwide. Our multidimensional pipeline declutters and optimizes the NHANES data, including redundant variable types, to streamline data integration and create a ‘machine learning-ready’ version of the report. In addition, this approach reveals data patterns that led to the discovery of previously unrecognized subtypes and variables associated with the clinical phenotype heterogeneity of dental caries. We observed diverging patterns of similarity within different age groups and different variable subsets, while deriving unexpected associations of sleep deprivation and specific laboratory markers and the disease. Altogether, we report a comprehensive data processing approach that can guide the development of more precise and robust machine learning predictive models for dental caries and other health conditions from NHANES.

Publisher

Research Square Platform LLC

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3