Abstract
AbstractDisambiguation of authors in digital libraries is essential for many tasks, including efficient bibliographical searches and scientometric analyses to the level of individuals. The question of how to link documents written by the same person has been given much attention by academic publishers and information retrieval researchers alike. Usual approaches rely on publications’ metadata such as affiliations, email addresses, co-authors, or scholarly topics. Lack of homogeneity in the structure of bibliographic collections and discipline-specific dissimilarities between them make the creation of general-purpose disambiguators arduous. We present an algorithm to disambiguate authorships in the Astrophysics Data System (ADS) following an established semi-supervised approach of training a classifier on authorship pairs and clustering the resulting graphs. Due to the lack of high-signal features such as email addresses and citations, we engineer additional content- and location-based features via text embeddings and named-entity recognition. We train various nonlinear tree-based classifiers and detect communities from the resulting weighted graphs through label propagation, a fast yet efficient algorithm that requires no tuning. The resulting procedure reaches reasonable complexity and offers possibilities for interpretation. We apply our method to the creation of author entities in a recent ADS snapshot. The algorithm is evaluated on 39 manually-labeled author blocks comprising 9545 authorships from 562 author profiles. Our best approach utilizes the Random Forest classifier and yields a micro- and macro-averaged BCubed $$\mathrm {F}_1$$
F
1
score of 0.95 and 0.87, respectively. We release our code and labeled data publicly to foster the development of further disambiguation procedures for ADS.
Funder
International Science Council
Hochschule für Technik und Wirtschaft Berlin
Publisher
Springer Science and Business Media LLC
Subject
Library and Information Sciences,Computer Science Applications,General Social Sciences
Reference40 articles.
1. Accomazzi, A. et al. (July 2018). New ADS Functionality for the Curator. In European physical journal web of conferences (Vol. 186, p. 08001). https://doi.org/10.1051/epjconf/201818608001. arXiv: 1710.08505 [astro-ph.IM].
2. Ackermann, M. R., & Reitz, F. (June 15, 2018). Homonym detection in curated bibliographies: Learning from dblp’s Experience (full version). In: arXiv:1806.06017 [cs]. (visited on 10/10/2020).
3. Amigó, E. et al. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. In Information retrieval (Vol. 12, No. 4, pp. 461–486). ISSN: 1573-7659. https://doi.org/10.1007/s10791-008-9066-8.
4. Backes, T. (2018). The impact of name-matching and blocking on author disambiguation. In Proceedings of the 27th ACM international conference on information and knowledge management. CIKM ’18 (pp. 803–812). Torino, Italy: Association for Computing Machinery. ISBN: 9781450360142. https://doi.org/10.1145/3269206.3271699.
5. Bastrakova, E. et al. (Nov. 2016). Relational machine learning author disambiguation. In 2016 IEEE artificial intelligence and natural language conference (AINL) (pp. 1–7).
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献