Species Detection and Segmentation of Multi-specimen Historical Herbaria

Author:

Thirukokaranam Chandrasekar Krishna Kumar,Milleville Kenzo,Verstockt Steven

Abstract

Historically, herbarium specimens have provided users with documented occurrences of plants in specific locations over time. Herbarium collections have therefore been the basis of systematic botany for centuries (Younis et al. 2020). According to the latest summary report based on the data from Index Herbariorum, there are around 3400 active herbaria in the world containing 397 million specimens that are spread across 182 countries (Thiers 2021). Exponential growth in high quality image capturing devices induced by the enormous amount of uncovered collections has further led to rising interest in large scale digitization initiatives across the world (Le Bras et al. 2017). As herbarium specimens are increasingly becoming digitised and accessible in online repositories, an important need has also emerged to develop automated tools to process and enrich these collections to facilitate better access to the preserved archives. This rising number of digitised herbarium sheets provides an opportunity to employ computer-based image processing techniques, such as deep learning, to automatically identify species and higher taxa (Carranza-Rojas and Joly 2018, Carranza-Rojas et al. 2017, Younis et al. 2020) or to extract other useful information from the herbaria sheets, such as detecting handwritten text, color bars, scales and barcodes. The species identification task works well for herbarium sheets that have only one species in a page. However, there are many herbarium books that have multiple species on the same page (as shown in Fig. 1) for which the complexity of the identification problem increases tremendously. It also involves a great deal of time and effort if they are to be enriched manually. In this work, we propose a pipeline that can automatically detect, identify, and enrich plant species in multi-specimen herbaria. The core idea of the pipeline is to detect unique plant species and handwritten text around the plant species and map the text to the correct plant species. As shown in Fig. 2, the proposed pipeline begins with the pre-processing of the images. The images are rotated and aligned such that the longest edge is maintained as its height. In the case of herbarium books, the pages are detected and morphological transformations are performed to reduce occlusions (Thirukokaranam Chandrasekar and Verstockt 2020). A YOLOv3 (You Only Look Once version 3) object detection model (Zhao and Li 2020) is trained from scratch to detect plants and text. The model was trained on a dataset of single species herbarium sheets with a mosaic augmentation technique to extend the plants model to detect multiple species. The first results of the training shows impressive results although it could be further improved with more labelled data. We also plan to train an object segmentation model and contrast its performance with the plant detection model for multi-specimen herbarium sheets. After detecting both the plants and the text, the text will be recognized with a state-of-the-art handwritten text recognition (HTR) model. The recognized text can then be matched with a database of specimens, to identify each detected specimen. Furthermore, additional textual metadata (e.g. date, locality, collector's name, institution) visible on the sheet will be recognized and used to enrich the collection.

Publisher

Pensoft Publishers

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3