Big Data Analysis Using Unsupervised Machine Learning: K-means Clustering and Isolation Forest Models for Efficient Anomaly Detection and Removal in Complex Lithologies-Reference-Cited by-同舟云学术

Big Data Analysis Using Unsupervised Machine Learning: K-means Clustering and Isolation Forest Models for Efficient Anomaly Detection and Removal in Complex Lithologies

Published:2024-02-12 Issue: Volume: Page:
ISSN:
Container-title:All Days
language:
Short-container-title:

Author:

Janjua Aneeq Nasir¹,Abdulraheem Abdulazeez¹,Tariq Zeeshan²

Affiliation:

1. King Fahd University of Petroleum & Minerals

2. King Abdullah University of Science and Technology

Abstract

Abstract Lithology identification holds a pivotal role in the characterization of subsurface formations. In recent years, the advent of big data and the need for more precise lithology identification have spurred the growing adoption of machine learning algorithms. This paper's primary objective is to leverage unsupervised machine learning techniques for the identification and subsequent removal of anomalies inherent in complex datasets. The Isolation Forest model is the cornerstone of our approach for anomaly detection and elimination. To initiate this process, we employed the K-means algorithm to create clusters, followed by a evaluation using silhouette coefficients. Subsequently, we selected input data for each cluster and conducted exploratory data analysis both prior to and after the removal of outliers. The histograms portraying average anomaly scores for each cluster were thoughtfully presented. For real-time anomaly detection, we harnessed the power of the Isolation Forest model, which facilitated the creation of an Isolation Forest anomalies map by plotting neutron-porosity against bulk density. Our rigorous data analysis employed various statistical techniques, including data statistics, histograms, and cross plots between neutron-porosity and bulk density. This multifaceted approach successfully eradicated anomalies from the dataset, a fact vividly illustrated by the histograms, where anomalies were discernible through their negative scores. The Isolation Forest map conclusively demonstrated the effective removal of outliers from the dataset, underscoring the model's proficiency in identifying and mitigating these anomalies based on their negative scores. The Isolation Forest model has thus exhibited remarkable efficacy in the identification and elimination of data anomalies. Its versatility makes it a valuable asset for the detection and removal of outliers, deviations, or noise from datasets, rendering it particularly well-suited for anomaly detection and outlier mitigation in various analytical scenarios. Notably, the combination of the K-means and Isolation Forest algorithms emerges as a potent and advantageous approach, especially when dealing with extensive datasets and conducting comprehensive analyses.

Publisher

IPTC

Link

https://onepetro.org/IPTCONF/proceedings-pdf/doi/10.2523/IPTC-23580-EA/3364651/iptc-23580-ea.pdf

Reference18 articles.

1. Abdulraheem, A., Sabakhy, E., Ahmed, M., Vantala, A., Raharja, P.D., Korvin, G., 2007. Estimation of permeability from wireline logs in a middle eastern carbonate reservoir using fuzzy logic. In: SPE Middle East Oil and Gas Show and Conference. OnePetro.

2. Outlier detection;Ben-Gal,2005

3. Outlier detection: Methods, models, and classification;Boukerche;ACM Comput. Surv.,2020

4. A new method of lithology classification based on convolutional neural network algorithm by utilizing drilling string vibration data;Chen;Energies,2020

5. Laskar, M., Huang, J., Smetana, V., Stewart, C., Pouw, K., An, A., Chan, S., and Liu, L. (2021). Extending Isolation Forest for Anomaly Detection in Big Data via K-Means. https://doi.org/10.48550/arXiv.2104.13190