Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data-Reference-Cited by-同舟云学术

Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data

Published:2021-04-02 Issue:2 Volume:2 Page:1-26
ISSN:2691-1922
Container-title:ACM/IMS Transactions on Data Science
language:en
Short-container-title:ACM/IMS Trans. Data Sci.

Author:

Liu Yu¹^ORCID,Wang Yangtao¹,Gao Lianli²,Guo Chan¹,Xie Yanzhao¹,Xiao Zhili³

Affiliation:

1. Huazhong University of Science and Technology, Wuhan, China

2. University of Electronic Science and Technology of China, Chengdu, China

3. Tencent Inc., Shenzhen, China

Abstract

Data mining can hardly solve but always faces a problem that there is little meaningful information within the dataset serving a given requirement. Faced with multiple unknown datasets, to allocate data mining resources to acquire more desired data, it is necessary to establish a data quality assessment framework based on the relevance between the dataset and requirements. This framework can help the user to judge the potential benefits in advance, so as to optimize the resource allocation to those candidates. However, the unstructured data (e.g., image data) often presents dark data states, which makes it tricky for the user to understand the relevance based on content of the dataset in real time. Even if all data have label descriptions, how to measure the relevance between data efficiently under semantic propagation remains an urgent problem. Based on this, we propose a Deep Hash-based Relevance-aware Data Quality Assessment framework, which contains off-line learning and relevance mining parts as well as an on-line assessing part. In the off-line part, we first design a Graph Convolution Network (GCN)-AutoEncoder hash (GAH) algorithm to recognize the data (i.e., lighten the dark data), then construct a graph with restricted Hamming distance, and finally design a Cluster PageRank (CPR) algorithm to calculate the importance score for each node (image) so as to obtain the relevance representation based on semantic propagation. In the on-line part, we first retrieve the importance score by hash codes and then quickly get the assessment conclusion in the importance list. On the one hand, the introduction of GCN and co-occurrence probability in the GAH promotes the perception ability for dark data. On the other hand, the design of CPR utilizes hash collision to reduce the scale of graph and iteration matrix, which greatly decreases the consumption of space and computing resources. We conduct extensive experiments on both single-label and multi-label datasets to assess the relevance between data and requirements as well as test the resources allocation. Experimental results show our framework can gain the most desired data with the same mining resources. Besides, the test results on Tencent1M dataset demonstrate the framework can complete the assessment with a stability for given different requirements.

Funder

Innovation Group Project of the National Natural Science Foundation of China

National Key Research and Development Program of China

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3420038

Reference50 articles.

1. Context-aware data quality assessment for big data

2. Michael J. Cafarella Ihab F. Ilyas Marcel Kornacker Tim Kraska and Christopher Ré. 2016. Dark data: Are we solving the right problems? In ICDE. 1444–1445. Michael J. Cafarella Ihab F. Ilyas Marcel Kornacker Tim Kraska and Christopher Ré. 2016. Dark data: Are we solving the right problems? In ICDE. 1444–1445.

3. Yue Cao Mingsheng Long Bin Liu and Jianmin Wang. 2018. Deep cauchy hashing for hamming space retrieval. In CVPR. 1229–1237. Yue Cao Mingsheng Long Bin Liu and Jianmin Wang. 2018. Deep cauchy hashing for hamming space retrieval. In CVPR. 1229–1237.

4. Zhao-Min Chen Xiu-Shen Wei Peng Wang and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In CVPR. 5177–5186. Zhao-Min Chen Xiu-Shen Wei Peng Wang and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In CVPR. 5177–5186.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data Quality Metrics for Unlabelled Datasets;2022 IEEE 4th International Conference on BioInspired Processing (BIP);2022-11-15