Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools-Reference-Cited by-同舟云学术

Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools

Published:2019-11-15 Issue:1 Volume:8 Page:
ISSN:2046-4053
Container-title:Systematic Reviews
language:en
Short-container-title:Syst Rev

Author:

Gates Allison^ORCID,Guitard Samantha,Pillay Jennifer,Elliott Sarah A.,Dyson Michele P.,Newton Amanda S.,Hartling Lisa

Abstract

Abstract Background We explored the performance of three machine learning tools designed to facilitate title and abstract screening in systematic reviews (SRs) when used to (a) eliminate irrelevant records (automated simulation) and (b) complement the work of a single reviewer (semi-automated simulation). We evaluated user experiences for each tool. Methods We subjected three SRs to two retrospective screening simulations. In each tool (Abstrackr, DistillerSR, RobotAnalyst), we screened a 200-record training set and downloaded the predicted relevance of the remaining records. We calculated the proportion missed and workload and time savings compared to dual independent screening. To test user experiences, eight research staff tried each tool and completed a survey. Results Using Abstrackr, DistillerSR, and RobotAnalyst, respectively, the median (range) proportion missed was 5 (0 to 28) percent, 97 (96 to 100) percent, and 70 (23 to 100) percent for the automated simulation and 1 (0 to 2) percent, 2 (0 to 7) percent, and 2 (0 to 4) percent for the semi-automated simulation. The median (range) workload savings was 90 (82 to 93) percent, 99 (98 to 99) percent, and 85 (85 to 88) percent for the automated simulation and 40 (32 to 43) percent, 49 (48 to 49) percent, and 35 (34 to 38) percent for the semi-automated simulation. The median (range) time savings was 154 (91 to 183), 185 (95 to 201), and 157 (86 to 172) hours for the automated simulation and 61 (42 to 82), 92 (46 to 100), and 64 (37 to 71) hours for the semi-automated simulation. Abstrackr identified 33–90% of records missed by a single reviewer. RobotAnalyst performed less well and DistillerSR provided no relative advantage. User experiences depended on user friendliness, qualities of the user interface, features and functions, trustworthiness, ease and speed of obtaining predictions, and practicality of the export file(s). Conclusions The workload savings afforded in the automated simulation came with increased risk of missing relevant records. Supplementing a single reviewer’s decisions with relevance predictions (semi-automated simulation) sometimes reduced the proportion missed, but performance varied by tool and SR. Designing tools based on reviewers’ self-identified preferences may improve their compatibility with present workflows. Systematic review registration Not applicable.

Funder

Agency for Healthcare Research and Quality

Publisher

Springer Science and Business Media LLC

Subject

Medicine (miscellaneous)

Link

http://link.springer.com/content/pdf/10.1186/s13643-019-1222-2.pdf

Reference32 articles.

1. Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7:e012545. https://doi.org/10.1136/bmjopen-2016-012545.

2. Thomas J, McNaught J, Ananiadou S. Applications of text mining within systematic reviews. Res Synth Methods. 2011;2:1–14. https://doi.org/10.1002/jrsm.27.

3. Tsafnat G, Glasziou P, Choong MK, Dunn A, Galgani F, Coiera E. Systematic review automation technologies. Syst Rev. 2014;3:74. https://doi.org/10.1186/2046-4053-3-74.