Blazing Signature Filter: a library for fast pairwise similarity comparisons-Reference-Cited by-同舟云学术

Blazing Signature Filter: a library for fast pairwise similarity comparisons

Published:2017-07-12 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Lee Joon-Yong,Fujimoto Grant M.,Wilson Ryan,Wiley H. Steven,Payne Samuel H.

Abstract

AbstractIdentifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is that the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.

Publisher

Cold Spring Harbor Laboratory

Reference30 articles.

1. Amino acid substitution matrices from protein blocks.

2. A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry

3. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

4. A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules

5. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Easy to Use Generalized Template to Support Development of GPU Algorithms;Computational Biology;2022