Abstract
AbstractMass spectrometry proteomics is a powerful tool in biomedical research but its usefulness is limited by the frequent occurrence of missing values in peptides that cannot be reliably quantified for particular samples. Many analysis strategies have been proposed for missing values where the discussion often focuses on distinguishing whether values are missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). We argue here that missing values should always be viewed as MNAR in label-free proteomics because physical missing value mechanisms cannot be identified for individual points and because the probability of detection is related to underlying intensity. We show that the probability of detection can be accurately modeled by a logit linear curve. The curve asymptotes close to 100%, limiting the potential role of missing values unrelated to intensity. The curve is also incompatible with simple censoring mechanisms. We propose a statistical method for estimating the detection probability curve as a function of the underlying intensity, whether observed or not. The model quantifies the bias of missing intensities as compared to those that are observed. The model demonstrates that missing values are informative and suggests possible approaches to imputation and differential expression.
Publisher
Cold Spring Harbor Laboratory