A Five-Step Workflow to Manually Annotate Unstructured Data into Training Dataset for Natural Language Processing-Reference-Cited by-同舟云学术

A Five-Step Workflow to Manually Annotate Unstructured Data into Training Dataset for Natural Language Processing

Published:2024-01-25 Issue: Volume: Page:
ISSN:0926-9630
Container-title:Studies in Health Technology and Informatics
language:
Short-container-title:

Author:

Zhu Yunshu¹^ORCID,Song Ting¹^ORCID,Zhang Zhenyu¹^ORCID,Yin Mengyang²^ORCID,Yu Ping¹^ORCID

Affiliation:

1. Centre for Digital Transformation, School of Computing and Information Technology, University of Wollongong, Wollongong, New South Wales, Australia

2. Opal Healthcare, Sydney, Australia

Abstract

Natural Language Processing (NLP) is a powerful technique for extracting valuable information from unstructured electronic health records (EHRs). However, a prerequisite for NLP is the availability of high-quality annotated datasets. To date, there is a lack of effective methods to guide the research effort of manually annotating unstructured datasets, which can hinder NLP performance. Therefore, this study develops a five-step workflow for manually annotating unstructured datasets, including (1) annotator training and familiarising with the text corpus, (2) vocabulary identification, (3) annotation schema development, (4) annotation execution, and (5) result validation. This framework was then applied to annotate agitation symptoms from the unstructured EHRs of 40 Australian residential aged care facilities. The annotated corpus achieved an accuracy rate of 96%. This suggests that our proposed annotation workflow can be used in manual data processing to develop annotated training corpus for developing NLP algorithms.

Publisher

IOS Press

Link

https://ebooks.iospress.nl/pdf/doi/10.3233/SHTI230937