Abstract
AbstractIdentifying how a given chemical of interest exerts its impact on biological systems is a critical step in developing new medicines and chemical products. The mechanism of a query compound of interest can sometimes be identified when its image-based morphological profile matches a compound in a library of well-annotated compound profiles.In this study, we demonstrate a significant improvement in classification performance by incorporating side information: gene representations. We generate these representations using the morphological profiles of cells where the level of a single gene’s expression has been artificially increased or decreased. The genes are selected as those encoding known protein targets of annotated compounds in the library. A transformer model is trained to classify gene-compound pairs, where each pair represents a potential interaction between a gene and a compound, as true or false. Subsequently, the model generates a ranked list of likely target genes for a previously unseen query compound. Although the strategy exhibits high performance only for compounds that target previously encountered genes – likely due to the limited size of our training dataset – the performance increase demonstrates a notable improvement over simply matching compound profiles directly to compound profiles or to gene profiles. Larger datasets may improve the prediction capabilities of this approach, enabling the prediction of gene targets for novel compounds, which can then be experimentally validated.
Publisher
Cold Spring Harbor Laboratory