Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students With Linguistically Enhanced Machine Learning Algorithms: Development and Validation Study-Reference-Cited by-同舟云学术

Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students With Linguistically Enhanced Machine Learning Algorithms: Development and Validation Study

Published:2021-10-26 Issue:10 Volume:9 Page:e25110
ISSN:2291-9694
Container-title:JMIR Medical Informatics
language:en
Short-container-title:JMIR Med Inform

Author:

Xie Wenxiu^ORCID,Ji Christine^ORCID,Hao Tianyong^ORCID,Chow Chi-Yin^ORCID

Abstract

Background There is an increasing body of research on the development of machine learning algorithms in the evaluation of online health educational resources for specific readerships. Machine learning algorithms are known for their lack of interpretability compared with statistics. Given their high predictive precision, improving the interpretability of these algorithms can help increase their applicability and replicability in health educational research and applied linguistics, as well as in the development and review of new health education resources for effective and accessible health education. Objective Our study aimed to develop a linguistically enriched machine learning model to predict binary outcomes of online English health educational resources in terms of their easiness and complexity for international tertiary students. Methods Logistic regression emerged as the best performing algorithm compared with support vector machine (SVM) (linear), SVM (radial basis function), random forest, and extreme gradient boosting on the transformed data set using L2 normalization. We applied recursive feature elimination with SVM to perform automatic feature selection. The automatically selected features (n=67) were then further streamlined through expert review. The finalized feature set of 22 semantic features achieved a similar area under the curve, sensitivity, specificity, and accuracy compared with the initial (n=115) and automatically selected feature sets (n=67). Logistic regression with the linguistically enhanced feature set (n=22) exhibited important stability and robustness on the training data of different sizes (20%, 40%, 60%, and 80%), and showed consistently high performance when compared with the other 4 algorithms (SVM [linear], SVM [radial basis function], random forest, and extreme gradient boosting). Results We identified semantic features (with positive regression coefficients) contributing to the prediction of easy-to-understand online health texts and semantic features (with negative regression coefficients) contributing to the prediction of hard-to-understand health materials for readers with nonnative English backgrounds. Language complexity was explained by lexical difficulty (rarity and medical terminology), verbs typical of medical discourse, and syntactic complexity. Language easiness of online health materials was associated with features such as common speech act verbs, personal pronouns, and familiar reasoning verbs. Successive permutation of features illustrated the interaction between these features and their impact on key performance indicators of the machine learning algorithms. Conclusions The new logistic regression model developed exhibited consistency, scalability, and, more importantly, interpretability based on existing health and linguistic research. It was found that low and high linguistic accessibilities of online health materials were explained by 2 sets of distinct semantic features. This revealed the inherent complexity of effective health communication beyond current readability analyses, which were limited to syntactic complexity and lexical difficulty.

Publisher

JMIR Publications Inc.

Subject

Health Information Management,Health Informatics

Reference40 articles.

1. Use of Machine Learning Algorithms to Predict the Understandability of Health Education Materials: Development and Evaluation Study

2. Written health education materials: Making them more effective

3. The quality and suitability of written educational materials for patients*

4. Issues in evaluating health websites in an Internet-based randomized controlled trial

5. Relationships among Educational Material Readability, Client Literacy, Perceived Beneficence, and Perceived Quality

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Advanced Machine Learning Models for Predicting Post-Thrombolysis Hemorrhagic Transformation in Acute Ischemic Stroke Patients: A Systematic Review and Meta-Analysis;Clinical and Applied Thrombosis/Hemostasis;2024-01

2. Multiple Automated Health Literacy Assessments of Written Health Information: Development of the SHeLL (Sydney Health Literacy Lab) Health Literacy Editor v1;JMIR Formative Research;2023-02-14