Affiliation:
1. School of EECS, Korea Aerospace University, Hanggongdaehak-ro 76-10, Deogyang-gu, Goyang-si 10540, Republic of Korea
2. Division of Computer Science and Engineering, Sahmyook University, Hwarang-ro 815, Nowon-gu, Seoul 01795, Republic of Korea
Abstract
User reviews such as SNS feeds and blog writings have been widely used to extract opinions, complains, and requirements about a given place or product from users’ perspective. However, during the process of collecting them, a lot of reviews that are irrelevant to a given search keyword can be included in the results. Such irrelevant reviews may lead to distorted results in data analysis. In this paper, we discuss a method to detect irrelevant user reviews efficiently by combining various oversampling and machine learning algorithms. About 35,000 user reviews collected from 25 restaurants and 33 tourist attractions in Ulsan Metropolitan City, South Korea, were used for learning, where the ratio of irrelevant reviews in the two kinds of data sets was 53.7% and 71.6%, respectively. To deal with skewness in the collected reviews, oversampling algorithms such as SMOTE, Borderline-SMOTE, and ADASYN were used. To build a model for the detection of irrelevant reviews, RNN, LSTM, GRU, and BERT were adopted and compared, as they are known to provide high accuracy in text processing. The performance of the detection models was examined through experiments, and the results showed that the BERT model presented the best performance, with an F1 score of 0.965.