A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter-Reference-Cited by-同舟云学术

A New Text Classification Model Based on Contrastive Word Embedding for Detecting Cybersecurity Intelligence in Twitter

Published:2020-09-18 Issue:9 Volume:9 Page:1527
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Shin Han-Sub^ORCID,Kwon Hyuk-Yoon^ORCID,Ryu Seung-Jin

Abstract

Detecting cybersecurity intelligence (CSI) on social media such as Twitter is crucial because it allows security experts to respond cyber threats in advance. In this paper, we devise a new text classification model based on deep learning to classify CSI-positive and -negative tweets from a collection of tweets. For this, we propose a novel word embedding model, called contrastive word embedding, that enables to maximize the difference between base embedding models. First, we define CSI-positive and -negative corpora, which are used for constructing embedding models. Here, to supplement the imbalance of tweet data sets, we additionally employ the background knowledge for each tweet corpus: (1) CVE data set for CSI-positive corpus and (2) Wikitext data set for CSI-negative corpus. Second, we adopt the deep learning models such as CNN or LSTM to extract adequate feature vectors from the embedding models and integrate the feature vectors into one classifier. To validate the effectiveness of the proposed model, we compare our method with two baseline classification models: (1) a model based on a single embedding model constructed with CSI-positive corpus only and (2) another model with CSI-negative corpus only. As a result, we indicate that the proposed model shows high accuracy, i.e., 0.934 of F1-score and 0.935 of area under the curve (AUC), which improves the baseline models by 1.76∼6.74% of F1-score and by 1.64∼6.98% of AUC.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/9/9/1527/pdf

Reference64 articles.

1. Twitter by Numbers: Stats, Demographics & Fun Facts https://www.omnicoreagency.com/twitter-statistics/

2. Text-Based Twitter User Geolocation Prediction

3. Automatic crime prediction using events extracted from twitter posts;Wang,2012

Cited by 28 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Correlation-driven multi-level learning for anomaly detection on multiple energy sources;Applied Soft Computing;2024-07

2. Analyzing user reactions using relevance between location information of tweets and news articles;EPJ Data Science;2024-06-26

3. A systematic review on research utilising artificial intelligence for open source intelligence (OSINT) applications;International Journal of Information Security;2024-06-05

4. RETRACTED: New ensemble learning algorithm based on classification certainty and semantic correlation;Journal of Intelligent & Fuzzy Systems;2024-04-18

5. DeepScraper: A complete and efficient tweet scraping method using authenticated multiprocessing;Data & Knowledge Engineering;2024-01