Abstract
BiotXplorer is an exploration tool to navigate biotic interactions in BiodiversityPMC (Pasche et al. 2023, Gobeill et al. 2020), a digitally native research library of articles for biomedical, biodiversity and environmental sciences stored in Journal Article Tag Suite (JATS)/BioC formats (Comeau et al. 2019). BiotXplorer pre-processes all documents and supplementary data thanks to the Swiss Institute of Bioinformatics (SIB) Literature Services (SIBiLS) to build pairs of species co-occurring in the same sentence together with a biotic interaction concept as defined in the Relation Ontology. A search service is built on top of this database, which aggregates all triplets matching the query and using taxonomic hierarchies to expand the search. Researchers can thus discover new biotic interactions and understand how they are supported by published evidence. We manually evaluated the precision of BiotXplorer with two benchmarks: 100 randomly selected biotic interactions from BiotXplorer and GLOBI (Global Biotic Interactions), a database of biotic interactions based on tabular datasets. Out of the 100 random triples generated by BiotXplorer, we achieved a precision of 31% when identifying the interacting species. For 74% of the correct interacting species, we accurately identified the type of interaction between the two species. The main causes of error were instances where passages listed multiple species, which can be automatically filtered out. For the second benchmark, we focused on a set of validated biotic interactions—instead of using potential ones—with 85% of the returned passages confirming an interaction between the two species. Our primary goal is to support the detection of biotic interactions across all species. While the precision is dependent on many factors, the vast amount of data it processes can reveal new insights and patterns. The inclusion of evidence for each triplet can support a wide range of One Health/Biosecurity (Hulme 2020) applications (e.g., eDNA characterization, virus spillover prediction). Furthermore, we are working on refining the system using different post-processing methods, such as reducing the volumes of triples by retaining only top-ranked, and therefore most reliable, triples.