Author:
Oscar Lithgow-Serrano,Socorro Gama-Castro,Cecilia Ishida-Gutiérrez,Citlali Mejía-Almonte,Víctor Tierrafría,Sara Martínez-Luna,Alberto Santos-Zavaleta,David Velázquez-Ramírez,Julio Collado-Vides
Abstract
AbstractThe ability to express the same meaning in different ways is a well known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit by efficient automatic processes, for which, corpora of sentences evaluated by experts is a valuable resource. Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators, and designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design including sampling, selection criteria, balance, and size among others. A non-fully crossed-design was performed for each pair of sentences by 3 evaluators from 7 different groups, adapting the SEMEVAL scale to our goals in four successive iterative sessions with a clear improvement in the consensuated guidelines and inter-rater-reliability results. Alternatives for the corpus evaluation are widely discussed. To the best of our knowledge this is the first similarity corpus in this domain of knowledge. We have initiated its incorporation in our research towards high throughput curation strategies based in natural language processing.
Publisher
Cold Spring Harbor Laboratory