Affiliation:
1. Universiti Teknologi Malaysia, Malaysia & Hradec Kralove University, Czech Republic
2. Universiti Teknologi Malaysia, Malaysia
3. Hradec Kralove University, Czech Republic
Abstract
The past decade has witnessed the rapid development of natural language processing and machine learning in the phishing detection domain. However, there needs to be more research on word embedding and deep learning for malicious URL classification. Inspired to solve this problem, this chapter aims to examine the application of word embedding and deep learning in extracting features from website URLs. To achieve this, several word embedding techniques, such as Keras, Word2Vec, GloVe, and FastText, were used to learn feature representations of webpage URLs. The obtained feature vectors were fed into a deep-learning model based on CNN-BiGRU for extraction and classification. Two different datasets were used to conduct numerous experiments, while various metrics were utilized to evaluate the phishing detection model's performance. The obtained findings indicated that when combined with deep learning, Keras outperformed other text embedding methods and achieved the best results across all evaluation metrics on both datasets.