Abstract
Abstract
Missing values and incomplete observations can exist in just about ever type of recorded data.
With analytical modeling, and machine learning in particular, the quantity and quality of available data is paramount to acquiring reliable results. Within the oil industry alone, priorities in which data is important can vary from company to company, leading to available knowledge of a single field to vary from place to place. With machine learning requiring very complete sets of data, this issue can require whole portions of data to be discarded in order to create an appropriate dataset.
Value imputation has emerged as a valuable solution in cleaning up datasets, and as current technology has advanced new generative machine learning methods have been used to generate images and data that is all but indistinguishable from reality. Using an adaptation of the standard Generative Adversarial Networks (GAN) approach known as a Generative Adversarial Imputation Network (GAIN), this paper evaluates this method and other imputation methods for filling in missing values. Using a gathered fully observed set of data, smaller datasets with randomly masked missing values were generated to validate the effectiveness of the various imputation methods; allowing comparisons to be made against the original dataset. The study found that with various sizes of missing data percentages withing the sets, the "filled in" data could be used with surprising accuracy for further analytics.
This paper compares GAIN along with several commonly used imputation methods against more standard practices such as data cropping or filling in with average values for filling in missing data. GAIN, as well as the various imputation methods described are quantified for there ability to fill in data. The study will discuss how the GAIN model can quickly provide the data necessary for analytical studies and prediction of results for future projects.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献