Author:
Tian Xuedong, ,Wang Jiameng,Wen Yu,Ma Hongyan, , ,
Abstract
<abstract>
<p>Scientific documents contain a large number of mathematical expressions and texts containing mathematical semantics. Simply using mathematical expressions or text to retrieve scientific documents can hardly meet retrieval needs. The real difficulty in retrieving scientific documents is to effectively integrate mathematical expressions and related textual features. Therefore, this study proposes a multi-attribute scientific documents retrieval and ranking model based on GBDT (gradient boosting decision tree) and LR (logistic regression) by integrating the expressions and text contained in scientific documents. First, the similarities of the five attributes are calculated, including mathematical expression symbols, mathematical expression sub-forms, mathematical expression context, scientific document keywords and the frequency of mathematical expressions. Next, the GBDT model is used to discretize and reorganize the five attributes. Finally, the reorganized features are input into the LR model, and the final retrieval and ranking results of scientific documents are obtained. The experiment in this study was carried out on the NTCIR dataset. The average value of the final MAP@20 of the scientific document recall was 81.92%. The average value of the scientific document ranking nDCG@20 was 86.05%.</p>
</abstract>
Publisher
American Institute of Mathematical Sciences (AIMS)
Subject
Applied Mathematics,Computational Mathematics,General Agricultural and Biological Sciences,Modeling and Simulation,General Medicine
Reference28 articles.
1. K. Yamada, H Murakami, Mathematical expression retrieval in PDFs from the web using mathematical term queries, in 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, (2020), 155-161. https://doi.org/10.1007/978-3-030-55789-8_14
2. R. M. Oliveira, F. B. Gonzaga, V. Barbosa, G. Xexéo, A distributed system for search on math based on the microsoft bizSpark program, preprint, arXiv: 1711.04189.
3. P. Sojka, M. Ruzicka, V. Novotný, MIaS: Math-aware retrieval in digital mathematical libraries, in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, (2018), 1923-1926. https://doi.org/10.1145/3269206.3269233
4. B. Mansouri, S. Rohatgi, D. Oard, J. Wu, C. L. Giles, R. Zanibbi, Tangent-CFT: An embedding model for mathematical formulas, in Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, (2019), 11-18. https://doi.org/10.1145/3341981.3344235
5. J. M. Xu, C. Y. Xu, Computing similarity of scientific documents based on texts and formulas, Data and Knowledge Discovery, 2 (2018), 103-109. Available from: https://wenku.baidu.com/view/3ca592af1cd9ad51f01dc281e53a580217fc500d.html?fr=income1-wk_app_search_ctr-search
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献