Abstract
AbstractHigh-content imaging (HCI) is a popular technique that leverages high throughput datasets to uncover phenotypes of cell populationsin vitro. When the differences between populations (such as a healthy and disease state) are completely unknown, it is crucial to build very large HCI screens to account for individual (donor) variation, as well as having enough replicates to create a reliable model. One approach to highlight phenotypic differences is to reduce images into a set of features using unbiased methods, such as embeddings or autoencoders. These methods are powerful at preserving the predictive power contained in each image while removing most of the unimportant image features and noise (e.g., background). However, they do not provide interpretable information about the features driving the decision process of the AI algorithm used. While tools have been developed to address this issue, such as CellProfiler, scaling this tool to large sample batches containing hundreds of thousands of images poses computational challenges. Additionally, the resulting feature vector, computationally expensive to have generated, is very large in size (containing over 3000 features) with many redundant features, making it challenging to perform further analysis and identify the truly relevant features. Ultimately, there is an increased risk of overfitting due to the presence of too many non-meaningful features that can ultimately skew downstream predictions.To address this issue, we have developed ScaleFExSM, a Python pipeline that extracts multiple generic fixed features at the single cell level that can be deployed across large high-content imaging datasets with low computational requirements. This pipeline efficiently and reliably computes features related to shape, size, intensity, texture, granularity as well as correlations between channels. Additionally, it allows the measurement of additional features specifically related to mitochondria and RNA only, as they represent important channels with characteristics worth to be measured on their own. The measured features can be used to not only separate populations of cells using AI tools, but also highlight the specific interpretable features that differ between populations. We applied ScaleFExSMto identify the phenotypic shifts that multiple cell lines undergo when exposed to different compounds. We used a combination of recursive feature elimination, logistic regression, correlation analysis and dimensionality reduction representations to narrow down to the most meaningful features that described the drug shifts. Furthermore, we used the best scoring features to extract images of cells for each class closest to the average to visually highlight the phenotypic shifts caused by the drugs. Using this approach, we were able to identify features linked to the drug shifts in line with literature, and we could visually validate their involvement in the morphological changes of the cells.ScaleFExSMcan be used as a powerful tool to understand the underlying phenotypes of complex diseases and subtle drug shifts at the single cell level, bringing us a step closer to identifying disease-modifying compounds for the major diseases of our time.
Publisher
Cold Spring Harbor Laboratory