Abstract
The paper presents CHerIDesCo - Cultural Heritage - Italian Description Corpus, a domain-specific linguistic resource designed for the training and testing of novel NLP tools in the Cultural Heritage field. The corpus has been developed by the UNIOR NLP Research group as a part of the SMACH project, a three-year project funded by the National Operative Program to pursue the Smart Specialization Strategy defined by the EU. The project aims at improving language-based human-computer interaction in the Cultural Heritage domain through the development of innovative applications for multilingual access to the contents based on semantic language technologies. In particular, the paper describes the design of the CHerIDesCo corpus, the annotation procedures, and the platforms where the resource has been uploaded. As pointed out in the conclusion, this linguistic resource can be exploited in several NLP tasks (e.g., NER - Named-Entity Recognition, NEL - Named-Entity Linking, and Topic Modeling).
Publisher
Servicio de Publicaciones de la Universidad Autonoma de Madrid
Reference34 articles.
1. Aloia, N., Concordia, C. & Meghini, C. 2011. Europeana v1.0. In M. Agosti, F. Esposito, C. Meghini & N. Orio (eds), Digital Libraries and Archives. 7th Italian Research Conference, IRCDL 2011, Pisa, Italy, January 20-21, 2011. Revised Papers (Communications in Computer and Information Science, Vol. 249). Berlin - Heidelberg: Springer-Verlag, 127-129.
2. Aresti, A. & Lanini, L. 2020. Corpus LBC Italiano. Firenze: Firenze University Press.
3. Baroni, M. & Ueyama, M. 2006. Building general- and special-purpose corpora by Web crawling. In Proceedings of the 13th NIJL international symposium, language corpora: Their compilation and application, 31-40.
4. Bertinetto, P.M. & Ossola, C. 1982. Insegnare stanca. Esercizi e proposte per l’insegnamento dell’italiano. Bologna: il Mulino.
5. Billero, R., & Nicolás Martínez, M.C. 2017. Nuove risorse per la ricerca del lessico del patrimonio culturale: corpora multilingue LBC. CHIMERA Romance Corpora and Linguistic Studies 4(2): 203-216.