Author:
Gutehrlé Nicolas,Atanassova Iana
Abstract
Background. In recent years, libraries and archives led important
digitisation campaigns that opened the access to vast collections of historical
documents. While such documents are often available as XML ALTO documents, they
lack information about their logical structure. In this paper, we address the
problem of Logical Layout Analysis applied to historical documents in French.
We propose a rule-based method, that we evaluate and compare with two
Machine-Learning models, namely RIPPER and Gradient Boosting. Our data set
contains French newspapers, periodicals and magazines, published in the first
half of the twentieth century in the Franche-Comt\'e Region. Results. Our
rule-based system outperforms the two other models in nearly all evaluations.
It has especially better Recall results, indicating that our system covers more
types of every logical label than the other two models. When comparing RIPPER
with Gradient Boosting, we can observe that Gradient Boosting has better
Precision scores but RIPPER has better Recall scores. Conclusions. The
evaluation shows that our system outperforms the two Machine Learning models,
and provides significantly higher Recall. It also confirms that our system can
be used to produce annotated data sets that are large enough to envisage
Machine Learning or Deep Learning approaches for the task of Logical Layout
Analysis. Combining rules and Machine Learning models into hybrid systems could
potentially provide even better performances. Furthermore, as the layout in
historical documents evolves rapidly, one possible solution to overcome this
problem would be to apply Rule Learning algorithms to bootstrap rule sets
adapted to different publication periods.
Publisher
Centre pour la Communication Scientifique Directe (CCSD)
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献