Abstract
AbstractDigital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances in these NLP models, most of them are built for specific languages and contemporary documents that are not optimized for handling historical material that may for instance contain language variations and optical character recognition (OCR) errors. In this work, we focused on the entity linking (EL) task that is fundamental to the indexation of documents in digital libraries. We developed a Multilingual Entity Linking architecture for HIstorical preSS Articles that is composed of multilingual analysis, OCR correction, and filter analysis to alleviate the impact of historical documents in the EL task. The source code is publicly available. Experimentation has been done over two historical document corpora covering five European languages (English, Finnish, French, German, and Swedish). Results have shown that our system improved the global performance for all languages and datasets by achieving an F-score@1 of up to 0.681 and an F-score@5 of up to 0.787.
Funder
H2020 Societal Challenges
Publisher
Springer Science and Business Media LLC
Subject
Library and Information Sciences
Reference62 articles.
1. Oberbichler, S., Pfanzelter, E., Marjanen, J., Hechl, S.: Doing historical research with digital newspapers: perspectives of dh scholars. EuropeanaTech Insight, 16: Newspapers (2020). https://pro.europeana.eu/page/issue-11-generous-interfaces
2. Bair, S., Carlson, S.: Where keywords fail: using metadata to facilitate digital humanities scholarship. J. Libr. Metadata 8(3), 249–262 (2008)
3. Wevers, M., Koolen, M.: Digital begriffsgeschichte: tracing semantic change using word embeddings. Hist. Methods J. Quant. Interdiscip. His. 53(4), 226–243 (2020)
4. Hechl, S., Langlais, P.C., Marjanen, J., Oberbichler, S., Pfanzelter, E.: Digital interfaces of historical newspapers: opportunities, restrictions and recommendations. J. Data Mining Digital, Hum (2021)
5. Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR quality on named entity linking. In: Digital libraries at the crossroads of digital information for the future - 21st international conference on Asia-Pacific digital libraries, ICADL 2019, Kuala Lumpur, Malaysia, November 4-7, 2019, Proceedings, pp. 102–115 (2019). https://doi.org/10.1007/978-3-030-34058-2_11
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献