Abstract
AbstractBackgroundRandom forest model is a recently developed machine-learning algorithm, and superior to other machine learning and regression models for its classification function and better accuracy. But it is rarely used for predicting causes of death in lung cancer patients. On the other hand, specific causes of death in lung cancer patients are poorly classified or predicted, largely due to its categorical nature (versus binary death/survival).MethodsWe therefore tuned and employed a random forest algorithm (Stata, version 15) to classify and predict specific causes of death in lung cancer patients, using the surveillance, epidemiology and end results-18 and several clinicopathological factors. The lung cancer diagnosed during 2004 were included for the completeness in their follow-up and death causes. The patients were randomly divided into training and validation sets (1:1 match). We also compared the accuracies of the final random forest and multinomial regression models.ResultsWe identified and randomly selected 40,000 lung cancers for the analyses, including 20,000 cases for either set. The causes of death were, in descending ranking order, were lung cancer (72.45 %), other causes or alive (14.38%), non-lung cancer (6.87%), cardiovascular disease (5.35%), and infection (0.95%). We found more 250 iterations and the 10 variables produced the best prediction, whose best accuracy was 69.8% (error-rate 30.2%). The final random forest model with 300 iterations and 10 variables reached an accuracy higher than that of multinomial regression model (69.8% vs 64.6%). The top-10 most important factors in the random-forest model were sex, chemotherapy status, age (65+ vs <65 years), radiotherapy status, nodal status, T category, histology type and laterality, which were also independently associated with 5-category causes of death.ConclusionWe optimized a random forest model of machine learning to predict the specific cause of death in lung cancer patients using a set of clinicopathologic factors. The model also appears more accurate than multinomial regression model.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献