On the choice of negative examples for prediction of host-pathogen protein interactions-Reference-Cited by-同舟云学术

On the choice of negative examples for prediction of host-pathogen protein interactions

Published:2022-12-15 Issue: Volume:2 Page:
ISSN:2673-7647
Container-title:Frontiers in Bioinformatics
language:
Short-container-title:Front. Bioinform.

Author:

Neumann Don,Roy Soumyadip,Minhas Fayyaz Ul Amir Afsar,Ben-Hur Asa

Abstract

As practitioners of machine learning in the area of bioinformatics we know that the quality of the results crucially depends on the quality of our labeled data. While there is a tendency to focus on the quality of positive examples, the negative examples are equally as important. In this opinion paper we revisit the problem of choosing negative examples for the task of predicting protein-protein interactions, either among proteins of a given species or for host-pathogen interactions and describe important issues that are prevalent in the current literature. The challenge in creating datasets for this task is the noisy nature of the experimentally derived interactions and the lack of information on non-interacting proteins. A standard approach is to choose random pairs of non-interacting proteins as negative examples. Since the interactomes of all species are only partially known, this leads to a very small percentage of false negatives. This is especially true for host-pathogen interactions. To address this perceived issue, some researchers have chosen to select negative examples as pairs of proteins whose sequence similarity to the positive examples is sufficiently low. This clearly reduces the chance for false negatives, but also makes the problem much easier than it really is, leading to over-optimistic accuracy estimates. We demonstrate the effect of this form of bias using a selection of recent protein interaction prediction methods of varying complexity, and urge researchers to pay attention to the details of generating their datasets for potential biases like this.

Publisher

Frontiers Media SA

Subject

General Medicine

Reference33 articles.

1. LGCA-VHPPI: A local-global residue context aware viral-host protein-protein interaction predictor;Asim;Plos one,2022

2. Training host-pathogen protein–protein interaction predictors;Basit;J. Bioinform. Comput. Biol.,2018

3. Choosing negative examples for the prediction of protein-protein interactions;Ben-Hur;BMC Bioinforma.,2006

4. Negatome 2.0: A database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis;Blohm;Nucleic Acids Res.,2014