Prompt Engineering or Fine-Tuning? A Case Study on Phishing Detection with Large Language Models-Reference-Cited by-同舟云学术

Prompt Engineering or Fine-Tuning? A Case Study on Phishing Detection with Large Language Models

Published:2024-02-06 Issue:1 Volume:6 Page:367-384
ISSN:2504-4990
Container-title:Machine Learning and Knowledge Extraction
language:en
Short-container-title:MAKE

Author:

Trad Fouad¹^ORCID,Chehab Ali¹^ORCID

Affiliation:

1. Electrical and Computer Engineering, American University of Beirut, Beirut 1107-2020, Lebanon

Abstract

Large Language Models (LLMs) are reshaping the landscape of Machine Learning (ML) application development. The emergence of versatile LLMs capable of undertaking a wide array of tasks has reduced the necessity for intensive human involvement in training and maintaining ML models. Despite these advancements, a pivotal question emerges: can these generalized models negate the need for task-specific models? This study addresses this question by comparing the effectiveness of LLMs in detecting phishing URLs when utilized with prompt-engineering techniques versus when fine-tuned. Notably, we explore multiple prompt-engineering strategies for phishing URL detection and apply them to two chat models, GPT-3.5-turbo and Claude 2. In this context, the maximum result achieved was an F1-score of 92.74% by using a test set of 1000 samples. Following this, we fine-tune a range of base LLMs, including GPT-2, Bloom, Baby LLaMA, and DistilGPT-2—all primarily developed for text generation—exclusively for phishing URL detection. The fine-tuning approach culminated in a peak performance, achieving an F1-score of 97.29% and an AUC of 99.56% on the same test set, thereby outperforming existing state-of-the-art methods. These results highlight that while LLMs harnessed through prompt engineering can expedite application development processes, achieving a decent performance, they are not as effective as dedicated, task-specific LLMs.

Funder

Maroun Semaan Faculty of Engineering and Architecture (MSFEA) at the American University of Beirut

Publisher

MDPI AG

Link

https://www.mdpi.com/2504-4990/6/1/18/pdf

Reference64 articles.

1. Social Network Mining from Natural Language Text and Event Logs for Compliance Deviation Detection;Mustroph;Cooperative Information Systems. CoopIS 2023,2024

2. Tailoring Large Language Models to Radiology: A Preliminary Approach to LLM Adaptation for a Highly Specialized Domain;Liu;Machine Learning in Medical Imaging. MLMI 2023,2024

3. GPT and CLT: The impact of ChatGPT’s level of abstraction on consumer recommendations;Kirshner;J. Retail. Consum. Serv.,2024

4. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot;Caruccio;Expert Syst. Appl.,2024

5. Shi, Y., Ren, P., Wang, J., Han, B., ValizadehAslani, T., Agbavor, F., Zhang, Y., Hu, M., Zhao, L., and Liang, H. (2023). Leveraging GPT-4 for food effect summarization to enhance product-specific guidance development via iterative prompting. J. Biomed. Inform., 148.

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enhancing privacy policy comprehension through Privacify: A user-centric approach using advanced language models;Computers & Security;2024-10

2. Performance of Recent Large Language Models for a Low-Resourced Language;2024 International Conference on Asian Language Processing (IALP);2024-08-04

3. Walkthrough phishing detection techniques;Computers and Electrical Engineering;2024-08

4. Framework for Integrating Generative AI in Developing Competencies for Accounting and Audit Professionals;Electronics;2024-07-04

5. LLM-Driven SAT Impact on Phishing Defense: A Cross-Sectional Analysis;2024 12th International Symposium on Digital Forensics and Security (ISDFS);2024-04-29