Survey on categorical data for neural networks-Reference-Cited by-同舟云学术

Survey on categorical data for neural networks

Published:2020-04-10 Issue:1 Volume:7 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Hancock John T.,Khoshgoftaar Taghi M.

Abstract

AbstractThis survey investigates current techniques for representing qualitative data for use as input to neural networks. Techniques for using qualitative data in neural networks are well known. However, researchers continue to discover new variations or entirely new methods for working with categorical data in neural networks. Our primary contribution is to cover these representation techniques in a single work. Practitioners working with big data often have a need to encode categorical values in their datasets in order to leverage machine learning algorithms. Moreover, the size of data sets we consider as big data may cause one to reject some encoding techniques as impractical, due to their running time complexity. Neural networks take vectors of real numbers as inputs. One must use a technique to map qualitative values to numerical values before using them as input to a neural network. These techniques are known as embeddings, encodings, representations, or distributed representations. Another contribution this work makes is to provide references for the source code of various techniques, where we are able to verify the authenticity of the source code. We cover recent research in several domains where researchers use categorical data in neural networks. Some of these domains are natural language processing, fraud detection, and clinical document automation. This study provides a starting point for research in determining which techniques for preparing qualitative data for use with neural networks are best. It is our intention that the reader should use these implementations as a starting point to design experiments to evaluate various techniques for working with qualitative data in neural networks. The third contribution we make in this work is a new perspective on techniques for using categorical data in neural networks. We organize techniques for using categorical data in neural networks into three categories. We find three distinct patterns in techniques that identify a technique as determined, algorithmic, or automated. The fourth contribution we make is to identify several opportunities for future research. The form of the data that one uses as an input to a neural network is crucial for using neural networks effectively. This work is a tool for researchers to find the most effective technique for working with categorical data in neural networks, in big data settings. To the best of our knowledge this is the first in-depth look at techniques for working with categorical data in neural networks.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-020-00305-w.pdf

Reference84 articles.

1. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.

2. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3(Jan):993–1022.

3. Cheng G, Berkhahn F. Entity embeddings of categorical variables. CoRR. 2016. arXiv:1604.06737.

4. Lacey M. Categorical data. 2019. http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm. Accessed 23 Sept 2019.

5. Lane DM. Online statistics education: an interactive multimedia course of study. 2019. http://onlinestatbook.com/2/index.html. Accessed 15 Dec 2019.

Cited by 307 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An Advisor Neural Network framework using LSTM-based Informative Stock Analysis;Expert Systems with Applications;2025-01

2. Challenges and opportunities of generative models on tabular data;Applied Soft Computing;2024-11

3. Spatial prediction of groundwater salinity in multiple aquifers of the Mekong Delta region using explainable machine learning models;Water Research;2024-11

4. m5C-Seq: Machine learning-enhanced profiling of RNA 5-methylcytosine modifications;Computers in Biology and Medicine;2024-11

5. Batch reinforcement learning approach using recursive feature elimination for network intrusion detection;Engineering Applications of Artificial Intelligence;2024-10