A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning

Author:

Ding Junhua1,Li Xinchuan2,Kang Xiaojun3,Gudivada Venkat N.4

Affiliation:

1. University of North Texas, Denton, TX

2. China University of Geosciences (Wuhan), Wuhan, China

3. China University of Geosciences (Wuhan), China

4. East Carolina University, Greenville, NC, USA

Abstract

Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are “fit for purpose” of deep learning is still a question. A framework for comprehensively evaluating the effectiveness of the augmented data for deep learning is still not available. In this article, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. To evaluate the quality of the augmented data in fidelity, variety, and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Reference50 articles.

1. 2016. ADDA project. Retrieved from: https://github.com/adda-team/adda. 2016. ADDA project. Retrieved from: https://github.com/adda-team/adda.

2. 2016. Apache Samza. Retrieved from: http://samza.apache.org/. 2016. Apache Samza. Retrieved from: http://samza.apache.org/.

3. 2017a. Deep learning tutorial. Retrieved from: http://deeplearning.net/tutorial/lenet.html. 2017a. Deep learning tutorial. Retrieved from: http://deeplearning.net/tutorial/lenet.html.

4. 2017b. Open AI: Generative Models. Retrieved from: https://openai.com/blog/generative-models/. 2017b. Open AI: Generative Models. Retrieved from: https://openai.com/blog/generative-models/.

5. Igor Barros Barbosa Marco Cristani Barbara Caputo Aleksander Rognhaugen and Theoharis Theoharis. 2017. Looking beyond appearances: Synthetic training data for deep CNNs in re-identification. Retrieved from: CoRR abs/1701.03153. Igor Barros Barbosa Marco Cristani Barbara Caputo Aleksander Rognhaugen and Theoharis Theoharis. 2017. Looking beyond appearances: Synthetic training data for deep CNNs in re-identification. Retrieved from: CoRR abs/1701.03153.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3