Machine learning in Huntington’s disease: exploring the Enroll-HD dataset for prognosis and driving capability prediction
-
Published:2023-07-27
Issue:1
Volume:18
Page:
-
ISSN:1750-1172
-
Container-title:Orphanet Journal of Rare Diseases
-
language:en
-
Short-container-title:Orphanet J Rare Dis
Author:
Ouwerkerk JasperORCID, Feleus StephanieORCID, van der Zwaan Kasper F.ORCID, Li YunleiORCID, Roos MarcoORCID, van Roon-Mom Willeke M. C.ORCID, de Bot Susanne T.ORCID, Wolstencroft Katherine J.ORCID, Mina EleniORCID
Abstract
Abstract
Background
In biomedicine, machine learning (ML) has proven beneficial for the prognosis and diagnosis of different diseases, including cancer and neurodegenerative disorders. For rare diseases, however, the requirement for large datasets often prevents this approach. Huntington’s disease (HD) is a rare neurodegenerative disorder caused by a CAG repeat expansion in the coding region of the huntingtin gene. The world’s largest observational study for HD, Enroll-HD, describes over 21,000 participants. As such, Enroll-HD is amenable to ML methods. In this study, we pre-processed and imputed Enroll-HD with ML methods to maximise the inclusion of participants and variables. With this dataset we developed models to improve the prediction of the age at onset (AAO) and compared it to the well-established Langbehn formula. In addition, we used recurrent neural networks (RNNs) to demonstrate the utility of ML methods for longitudinal datasets, assessing driving capabilities by learning from previous participant assessments.
Results
Simple pre-processing imputed around 42% of missing values in Enroll-HD. Also, 167 variables were retained as a result of imputing with ML. We found that multiple ML models were able to outperform the Langbehn formula. The best ML model (light gradient boosting machine) improved the prognosis of AAO compared to the Langbehn formula by 9.2%, based on root mean squared error in the test set. In addition, our ML model provides more accurate prognosis for a wider CAG repeat range compared to the Langbehn formula. Driving capability was predicted with an accuracy of 85.2%. The resulting pre-processing workflow and code to train the ML models are available to be used for related HD predictions at: https://github.com/JasperO98/hdml/tree/main.
Conclusions
Our pre-processing workflow made it possible to resolve the missing values and include most participants and variables in Enroll-HD. We show the added value of a ML approach, which improved AAO predictions and allowed for the development of an advisory model that can assist clinicians and participants in estimating future driving capability.
Publisher
Springer Science and Business Media LLC
Subject
Pharmacology (medical),Genetics (clinical),General Medicine
Reference35 articles.
1. Caron NS, Wright GEB, Hayden MR. Huntington disease (1998). https://www.ncbi.nlm.nih.gov/books/NBK1305/ Accessed 18 July 2022. 2. Dayalu P, Albin RL. Huntington disease: pathogenesis and treatment. Neurol Clin. 2015;33:101–14. 3. Li S-H, Li X-J. Huntingtin-protein interactions and the pathogenesis of Huntington’s disease. Trends Genet. 2004;20:146–54. 4. MacDonald ME, Ambrose CM, Duyao MP, Myers RH, Lin C, Srinidhi L, Barnes G, Taylor SA, James M, Groot N, MacFarlane H, Jenkins B, Anderson MA, Wexler NS, Gusella JF, Bates GP, Baxendale S, Hummerich H, Kirby S, North M, Youngman S, Mott R, Zehetner G, Sedlacek Z, Poustka A, Frischauf A-M, Lehrach H, Buckler AJ, Church D, Doucette-Stamm L, O’Donovan MC, Riba-Ramirez L, Shah M, Stanton VP, Strobel SA, Draths KM, Wales JL, Dervan P, Housman DE, Altherr M, Shiang R, Thompson L, Fielder T, Wasmuth JJ, Tagle D, Valdes J, Elmer L, Allard M, Castilla L, Swaroop M, Blanchard K, Collins FS, Snell R, Holloway T, Gillespie K, Datson N, Shaw D, Harper PS. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell. 1993;72:971–83. 5. Gusella JF, MacDonald ME, Lee J-M. Genetic modifiers of Huntington’s disease. Mov Disord. 2014;29:1359–65.
|
|