Affiliation:
1. Department of Computer Science, University of Nebraska at Omaha, Omaha, NE 68182, USA
Abstract
This paper presents a quantitative analysis of the nonlinearities of the positive predictive value (PPV) and its effect in evaluating two-class pattern classification models with imbalanced datasets. The analysis is made through an expression of the PPV as a function of two other classification ratios that are invariant to the data imbalance —the true positive rate (TPR) and false positive rate (FPR), and [Formula: see text] — the imbalance ratio (IR) of the dataset such that PPV [Formula: see text]TPR/([Formula: see text]TPR[Formula: see text]FPR). The curvatures of PPV in the three-dimensional TPR–FPR–[Formula: see text] space are studied using the Hessian matrix, from which a saddle-shaped 3D surface in the space is revealed. This paper explores the nonlinear behaviors of PPV around the critical points, identified at FPR [Formula: see text]TPR on the saddle surface, along with its scaling and sensitivity issues as performance measurements in model evaluation. The effect of the nonlinearities of PPV for the F1 and MCC metrics on imbalanced datasets is also studied. It is warned through the results of this study that the evaluations of classification models could be misleading if without an awareness and understanding of the nonlinearities associated with the PPV and its relevant metrics on imbalanced datasets.
Publisher
World Scientific Pub Co Pte Ltd