Statistical Analysis of Online Public Survey Lifestyle Datasets: A Machine Learning and Semantic Approach-Reference-Cited by-同舟云学术

Statistical Analysis of Online Public Survey Lifestyle Datasets: A Machine Learning and Semantic Approach

Published:2023-11-28 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Chatterjee Ayan¹,Riegler Michael A.¹,Johnson Miriam Sinkerud²,Das Jishnu³,Pahari Nibedita⁴,Ramachandra Raghavendra⁵,Ghosh Bikramaditya⁶,Saha Arpan⁷,Bajpai Ram⁸

Affiliation:

1. Simula Metropolitan Center for Digital Engineering

2. Oslo Metropolitan University

3. University of Agder

4. Tietoevry As

5. Norwegian University of Science and Technology

6. Symbiosis International University

7. Bharat Pharmaceutical Technology

8. Keele University

Abstract

Abstract Lifestyle diseases are the leading cause of the global health-related burden. A wide range of lifestyle factors has been shown to affect the pathogenesis of depression. The emergence of the COVID-19 pandemic has created an environment in which many determinants of depression are exacerbated. This study aims at identifying potential lifestyle and demographic factors associated with symptoms of depression among Indians during the COVID-19 pandemic. In this regard, we conducted an online public survey in Kolkata, India, from random voluntary participants to collect data for statistical analysis, feature selection, and supervised and unsupervised machine learning. Moreover, we designed an Ontology to represent the resulting dataset semantically. We conducted an online workshop with researchers, professionals, and a group of participants to prepare the roadmap and a set of online questionnaires to survey random participants from Kolkata, India following the inclusion and exclusion criteria. We used social media platforms (e.g., Facebook, WhatsApp, and LinkedIn) and electronic mailing (E-mail) to distribute a questionnaire set (a Google multiple-choice form) with forty-four questions. The survey data was collected anonymously and did not contain any personally identified information. The survey lasted for three months (June 2021 to August 2021), and the participation was voluntary. We used Python-based statistical and data visualization tools to clean and analyze the collected survey dataset. Furthermore, we designed an ontology model to represent the knowledge obtained from the survey dataset in a meaningful way. Our prepared questionnaire was easy to comprehend and easy to answer. According to the survey, it took on average 15–18 minutes (mins) to answer. We included defined population groups in this survey, such as age group > = 18 and < 65; both male and female; digitally literate; understand English; use Internet connectivity; infected or non-infected with COVID-19; willingness and motivation level, etc. The survey resulted in data from 1,834 participants. After the removal of missing data and outliers, we retained 1,767 participants for further analysis. Feature selection methods, such as Principal Component Analysis (PCA), Analysis of variance (ANOVA), correlation analysis, SelectKBest, and ExtraTreeClassifier were used to rank and select potential important features from the dataset. Using K-means, we divided the min-max scaled dataset into five clusters with a Silhouette score of 0.12 and cross-verification with the Elbow method. Support Vector Machine (SVC) with linear kernel produced the highest accuracy of 96% (F1-96%, precision = 95%, recall = 96%, MCC = 94%) with 31 features using a PCA pipeline in a multi-class classification problem. The OWL Ontology helped with semantic representation and reasoning of the gained knowledge from the survey dataset. This study has shown a pipeline to collect, analyze and semantically represent datasets from an online public survey of random participants during the COVID-19 pandemic. Moreover, we correlated factors identified from the collected dataset with depressive health. However, this online public survey has its own merits (e.g., easy data collection, easy data visualization, minimal cost, flexibility, non-bias, identity preservation, and accessibility) and challenges (e.g., willingness, language problem, difficulty in reaching the targeted population, digital literacy, dishonest response, and sampling error).

Publisher

Research Square Platform LLC

Reference42 articles.

1. The use of ICT during COVID-19;Yang S;Proc Assoc Inf Sci Technol,2020

2. Remote COVID-19 patient monitoring system: a qualitative evaluation;Oliver J;BMJ open,2022

3. Statistical explorations and univariate timeseries analysis on COVID-19 datasets to understand the trend of disease spreading and death;Chatterjee A;Sensors,2020

4. Remote Patient Monitoring During COVID-19: An Unexpected Patient Safety Benefit;Pronovost PJ;JAMA,2022

5. Everyone Included: Social Impact of COVID-19. Webpage: https://www.un.org/development/desa/dspd/everyone-included-covid-19.html. (10th January 2023)