Abstract
AbstractAutomatic transcription of large series of historical handwritten documents generally aims at allowing to search for textual information in these documents. However, automatic transcripts often lack the level of accuracy needed for reliable text indexing and search purposes. Probabilistic Indexing (PrIx) offers a unique alternative to raw transcripts. Since it needs training data to achieve good search performance, PrIx-based crowdsourcing techniques are introduced in this paper to gather the required data. In the proposed approach, PrIx confidence measures are used to drive a correction process in which users can amend errors and possibly add missing text. In a further step, corrected data are used to retrain the PrIx models. Results on five large series are reported which show consistent improvements after retraining. However, it can be argued whether the overall costs of the crowdsourcing operation pay off for the improvements, or perhaps it would have been more cost-effective to just start with a larger and cleaner amount of professionally produced training transcripts.
Funder
CDTI
Universitat Politècnica de València
Publisher
Springer Science and Business Media LLC
Reference56 articles.
1. Blickhan S, Krawczyk C, Hanson D, Boyer A, Simenstad A, van Hyning V (2019) Individual vs collaborative methods of crowdsourced transcription. J Data Mining Digit Hum. https://doi.org/10.46298/jdmdh.5759
2. Bluche T (2015) Deep neural networks for large vocabulary handwritten text recognition. PhD thesis, Ecole Doctorale Informatique de Paris-Sud - Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur
3. Bluche T, Hamel S, Kermorvant C, et al (2017) Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: International conference on document analysis and recognition (ICDAR), pp 311–316
4. Causer T, Tonra J, Wallace V (2012) Transcription maximized; expense minimized? Crowdsourcing and editing the collected works of Jeremy Bentham. Lit Linguist Comput 27(2):119–137
5. Causer T, Grint K, Sichani AM et al (2018) ‘making such bargain’: transcribe Bentham and the quality and cost-effectiveness of crowdsourced transcription. Digit Scholar Hum 33(3):467–487