An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data-Reference-Cited by-同舟云学术

An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data

Published:2022-01-14 Issue:1 Volume:7 Page:11
ISSN:2306-5729
Container-title:Data
language:en
Short-container-title:Data

Author:

Al-Bana Mohamed Reda^ORCID,Farhan Marwa Salah^ORCID,Othman Nermin Abdelhakim^ORCID

Abstract

Frequent itemset mining (FIM) is a common approach for discovering hidden frequent patterns from transactional databases used in prediction, association rules, classification, etc. Apriori is an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori is used to scan the dataset multiple times to generate big frequent itemsets with different cardinalities. Apriori performance descends when data gets bigger due to the multiple dataset scan to extract the frequent itemsets. Eclat is a scalable version of the Apriori algorithm that utilizes a vertical layout. The vertical layout has many advantages; it helps to solve the problem of multiple datasets scanning and has information that helps to find each itemset support. In a vertical layout, itemset support can be achieved by intersecting transaction ids (tidset/tids) and pruning irrelevant itemsets. However, when tids become too big for memory, it affects algorithms efficiency. In this paper, we introduce SHFIM (spark-based hybrid frequent itemset mining), which is a three-phase algorithm that utilizes both horizontal and vertical layout diffset instead of tidset to keep track of the differences between transaction ids rather than the intersections. Moreover, some improvements are developed to decrease the number of candidate itemsets. SHFIM is implemented and tested over the Spark framework, which utilizes the RDD (resilient distributed datasets) concept and in-memory processing that tackles MapReduce framework problem. We compared the SHFIM performance with Spark-based Eclat and dEclat algorithms for the four benchmark datasets. Experimental results proved that SHFIM outperforms Eclat and dEclat Spark-based algorithms in both dense and sparse datasets in terms of execution time.

Publisher

MDPI AG

Subject

Information Systems and Management,Computer Science Applications,Information Systems

Link

https://www.mdpi.com/2306-5729/7/1/11/pdf

Reference44 articles.

1. Data Mining Concepts and Techniques, 550 https://www.researchgate.net/publication/235902451_Data_Mining_Concept_and_Techniques

2. Frequent Itemsets Mining for Big Data: A Comparative Analysis

3. Big Data Tutorial|All You Need to Know about Big Data|Edureka https://www.edureka.co/blog/big-data-tutorial

4. A survey of open source tools for machine learning with big data in the Hadoop ecosystem

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. GMiner++: Boosting GPU-based frequent itemset mining by reducing redundant computations;Expert Systems with Applications;2024-09

2. Dynamic Adaptive Mechanism Design and Implementation in VSS for Large-Scale Unified Log Data Collection;International Journal of Information Security and Privacy;2024-08-09

3. A Model for Enhancing Unstructured Big Data Warehouse Execution Time;Big Data and Cognitive Computing;2024-02-06

4. Efficient approach of high average utility pattern mining with indexed list-based structure in dynamic environments;Information Sciences;2024-02

5. Discovery of interesting frequent item sets in an uncertain database using ant colony optimization;International Journal of Computers and Applications;2023-10-09