Author:
Huckleberry Sean C.,Silva Mary S.,Drocco Jeffrey A.
Abstract
AbstractCurrent methods of addressing novel viruses remain predominantly reactive and reliant on empirical strategies. To develop more proactive methodologies for the early identification and treatment of diseases caused by viruses like HIV and Sars-CoV-2, we focus on host targeting, which requires identifying and altering human genetic host factors that are crucial to the life cycle of these viruses. To this end, we present three classification models to pinpoint host genes of interest. For each one, we thoroughly analyze the current predictive accuracy, susceptibility to modifications of the input space, and potential for further optimization. Our methods rely on the exploration of different gene representations, including graph-based embeddings and large foundation transformer models, to establish a set of baseline classification models. Subsequently, we introduce an order-invariant Siamese neural network that exhibits more robust pattern recognition with sparse datasets while ensuring that the representation does not capture unwanted patterns, such as the directional relationship of genetic interactions. Through these models, we generate biological features that predict pairwise gene interactions, with the intention of extrapolating this proactive therapeutic approach to other virus families.
Publisher
Cold Spring Harbor Laboratory