An Efficient and Principled Model to Jointly Learn the Agnostic and Multifactorial Effect in Large-Scale Biological Data

Author:

Cheng Zuolin,Wei Songtao,Wang Yinxue,Wang Yizhi,Lu Q Richard,Wang Yue,Yu Guoqiang

Abstract

AbstractThe rich information contained in biological data is often distorted by multiple interacting intrinsic or extrinsic factors. Modeling the effects of these factors is necessary to uncover the underlying true signals. However, this is challenging in real applications, because biological data usually consist of tens of thousands or millions of factors, and no reliable prior knowledge is available on how these factors exert the effect, to what degree the effect is, as well as how they interact with each other. Thus, the existing approaches rely on excessive simplification or unrealistic assumptions such as the probabilistic independence among factors. In this paper, we report the finding that after reformulating the data as a contingency tensor the problem can be well addressed by a fundamental machine learning principle, Maximum Entropy, with an extra effort of designing an efficient algorithm to solve the large-scale optimization problem. Based on the principle of maximum entropy, and by constraining the marginals of the contingency tensor using the observed values, our Conditional Multifactorial Contingency (CMC) model imposes minimum but essential assumptions about the multifactorial joint effects and leads to a conceptually simple distribution, which informs how these factors exert the effects and interact with each other. By replacing hard constraints with expected values, CMC avoids the NP-hard problem and results in a theoretically solvable convex problem. However, due to the large scale of variables and constraints, the standard convex solvers do not work. Exploring the special properties of the CMC model we developed an efficient iterative optimizer, which reduces the running time from infeasible to minutes or from days to seconds. We applied CMC to quite a few cutting-edge biological applications, including the detection of driving transcription factor, scRNA-seq normalization, cancer-associated gene identification, GO-term activity transformation, and quantification of single-cell-level similarity. CMC gained much better performance than other methods with respect to various evaluation criteria. Our source code of CMC as well as its example applications can be found athttps://github.com/yu-lab-vt/CMC.One-Sentence SummaryCMC jointly learns intertwined effects of numerous factors in biologival data and outperform existing methods in multiple cutting-edge biological applications.

Publisher

Cold Spring Harbor Laboratory

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3