Abstract
Missing data and class imbalance represent a hindrance to accurate prediction of rare events such as mastitis (udder inflammation). Various methods are susceptible to handle the problem, however, little is known about their individual and combined effects on the performance of ML models fitted to AMS (automated milking system) data for mastitis prediction. We apply imputation and resampling to improve performance metrics of classifiers (logistic regression, stochastic gradient descent, multilayer perceptron, decision tree and random forest). Three imputation methods: simple imputer (SI), multiple imputer (MICE) and linear interpolation (LI) were compared to complete cases. Three resampling procedures: synthetic minority oversampling technique (SOMTE), Support Vector Machine SMOTE and SMOTE with Edited Nearest Neighbours were compared. We evaluated different techniques by calculating precision, recall, F1 Score and compared models based on kappa score. Both imputation and resampling techniques improved models performance. Complete case analysis suited the Stochastic Gradient Descent (SGD) Classifier better than resampling or imputation (kappa=0.280). The Logistic regression (LR) performed better with SVMSMOTE rand no imputation (kappa= 0.218). The Random Forest (RF), Decision Tree (DT) and Multilayer Perceptron (MLP) performed better than SGD and LR and handled well class imbalance and missing values without preprocessing. We propose careful selection of the technique to handle class imbalance and missing value prior to subjecting data to ML model is crucial to attain best ML model performance.