BACKGROUND
Accurate projections of procedural case durations are complex, but critical to planning of perioperative staffing, operating room resources, and patient communication. Nonlinear prediction models using machine learning methods may provide opportunities for hospitals to improve upon current estimates of procedure duration.
OBJECTIVE
We hypothesized a machine learning algorithm derived from a large multicenter dataset would more accurately predict surgical procedure duration when compared to a baseline linear regression approach. Using an explainable machine learning-based algorithm, results provide additional valuable insight regarding procedure duration and variability.
METHODS
A total of 1,177,893 procedures from 13 academic and private hospitals between 2016 and 2019 were used. Deep learning, gradient boosting, and ensemble machine learning models were generated using perioperative data available at three distinct time points: time of scheduling, time of arrival to the operating/procedure room (primary model), and time of surgical incision/procedure start. The primary outcome was procedure duration, defined by the time between arrival and departure of the patient from the procedure room. Model performance was assessed by mean absolute error, proportion of predictions within 20% of actual duration, and other standard metrics. Performance was compared to a baseline method of historical means within a linear regression model. Model features driving predictions were assessed using Shapley values and permutation feature importance.
RESULTS
Across all procedures, median procedure duration was 94 minutes (interquartile range of 50-167 minutes). In estimating procedure duration, the gradient boosting machine was the best performing model, demonstrating a mean absolute error of 34 minutes with 46% of predictions within 20% of actual duration in the test dataset. This represented a statistically and clinically significant improvement in predictions compared to a baseline linear regression model (43 minutes, p < 0.001; 39% of predictions within 20% of actual duration). The most important features in model training were historical procedure duration by surgeon, the word “free” within the procedure text, and time of day.
CONCLUSIONS
Nonlinear models using machine learning techniques may be used to generate high-performing, automatable, explainable, and scalable prediction models for procedure duration. Medi