Statistical Analysis of Online Public Survey Lifestyle Datasets: A Machine Learning and Semantic Approach

Author:

Chatterjee Ayan1,Riegler Michael A.1,Johnson Miriam Sinkerud2,Das Jishnu3,Pahari Nibedita4,Ramachandra Raghavendra5,Ghosh Bikramaditya6,Saha Arpan7,Bajpai Ram8

Affiliation:

1. Simula Metropolitan Center for Digital Engineering

2. Oslo Metropolitan University

3. University of Agder

4. Tietoevry As

5. Norwegian University of Science and Technology

6. Symbiosis International University

7. Bharat Pharmaceutical Technology

8. Keele University

Abstract

Abstract Lifestyle diseases are the leading cause of the global health-related burden. A wide range of lifestyle factors has been shown to affect the pathogenesis of depression. The emergence of the COVID-19 pandemic has created an environment in which many determinants of depression are exacerbated. This study aims at identifying potential lifestyle and demographic factors associated with symptoms of depression among Indians during the COVID-19 pandemic. In this regard, we conducted an online public survey in Kolkata, India, from random voluntary participants to collect data for statistical analysis, feature selection, and supervised and unsupervised machine learning. Moreover, we designed an Ontology to represent the resulting dataset semantically. We conducted an online workshop with researchers, professionals, and a group of participants to prepare the roadmap and a set of online questionnaires to survey random participants from Kolkata, India following the inclusion and exclusion criteria. We used social media platforms (e.g., Facebook, WhatsApp, and LinkedIn) and electronic mailing (E-mail) to distribute a questionnaire set (a Google multiple-choice form) with forty-four questions. The survey data was collected anonymously and did not contain any personally identified information. The survey lasted for three months (June 2021 to August 2021), and the participation was voluntary. We used Python-based statistical and data visualization tools to clean and analyze the collected survey dataset. Furthermore, we designed an ontology model to represent the knowledge obtained from the survey dataset in a meaningful way. Our prepared questionnaire was easy to comprehend and easy to answer. According to the survey, it took on average 15–18 minutes (mins) to answer. We included defined population groups in this survey, such as age group > = 18 and < 65; both male and female; digitally literate; understand English; use Internet connectivity; infected or non-infected with COVID-19; willingness and motivation level, etc. The survey resulted in data from 1,834 participants. After the removal of missing data and outliers, we retained 1,767 participants for further analysis. Feature selection methods, such as Principal Component Analysis (PCA), Analysis of variance (ANOVA), correlation analysis, SelectKBest, and ExtraTreeClassifier were used to rank and select potential important features from the dataset. Using K-means, we divided the min-max scaled dataset into five clusters with a Silhouette score of 0.12 and cross-verification with the Elbow method. Support Vector Machine (SVC) with linear kernel produced the highest accuracy of 96% (F1-96%, precision = 95%, recall = 96%, MCC = 94%) with 31 features using a PCA pipeline in a multi-class classification problem. The OWL Ontology helped with semantic representation and reasoning of the gained knowledge from the survey dataset. This study has shown a pipeline to collect, analyze and semantically represent datasets from an online public survey of random participants during the COVID-19 pandemic. Moreover, we correlated factors identified from the collected dataset with depressive health. However, this online public survey has its own merits (e.g., easy data collection, easy data visualization, minimal cost, flexibility, non-bias, identity preservation, and accessibility) and challenges (e.g., willingness, language problem, difficulty in reaching the targeted population, digital literacy, dishonest response, and sampling error).

Publisher

Research Square Platform LLC

Reference42 articles.

1. The use of ICT during COVID-19;Yang S;Proc Assoc Inf Sci Technol,2020

2. Remote COVID-19 patient monitoring system: a qualitative evaluation;Oliver J;BMJ open,2022

3. Statistical explorations and univariate timeseries analysis on COVID-19 datasets to understand the trend of disease spreading and death;Chatterjee A;Sensors,2020

4. Remote Patient Monitoring During COVID-19: An Unexpected Patient Safety Benefit;Pronovost PJ;JAMA,2022

5. Everyone Included: Social Impact of COVID-19. Webpage: https://www.un.org/development/desa/dspd/everyone-included-covid-19.html. (10th January 2023)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3