Utilizing deep learning and graph mining to identify drug use on Twitter data-Reference-Cited by-同舟云学术

Utilizing deep learning and graph mining to identify drug use on Twitter data

Published:2020-12 Issue:S11 Volume:20 Page:
ISSN:1472-6947
Container-title:BMC Medical Informatics and Decision Making
language:en
Short-container-title:BMC Med Inform Decis Mak

Author:

Tassone Joseph,Yan Peizhi,Simpson Mackenzie,Mendhe Chetan,Mago Vijay^ORCID,Choudhury Salimur

Abstract

Abstract Background The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. Through the analysis of a collected set of Twitter data, a model will be developed for predicting positively referenced, drug-related tweets. From this, trends and correlations can be determined. Methods Social media data (tweets and attributes) were collected and processed using topic pertaining keywords, such as drug slang and use-conditions (methods of drug consumption). Potential candidates were preprocessed resulting in a dataset of 3,696,150 rows. The predictive classification power of multiple methods was compared including SVM, XGBoost, BERT and CNN-based classifiers. For the latter, a deep learning approach was implemented to screen and analyze the semantic meaning of the tweets. Results To test the predictive capability of the model, SVM and XGBoost were first employed. The results calculated from the models respectively displayed an accuracy of 59.33% and 54.90%, with AUC’s of 0.87 and 0.71. The values show a low predictive capability with little discrimination. Conversely, the CNN-based classifiers presented a significant improvement, between the two models tested. The first was trained with 2661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Using association rule mining in conjunction with the CNN-based classifier showed a high likelihood for keywords such as “smoke”, “cocaine”, and “marijuana” triggering a drug-positive classification. Conclusion Predictive analysis with a CNN is promising, whereas attribute-based models presented little predictive capability and were not suitable for analyzing text of data. This research found that the commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of this system. Lastly, the synthetically generated set provided increased accuracy scores and improves the predictive capability.

Funder

Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada

Publisher

Springer Science and Business Media LLC

Subject

Health Informatics,Health Policy,Computer Science Applications

Link

http://link.springer.com/content/pdf/10.1186/s12911-020-01335-3.pdf

Reference40 articles.

1. Johnson T. Sources of error in substance use prevalence surveys. Int Schol Res Not. 2014. https://doi.org/10.1155/2014/923290.

2. Sarker A, O’Connor K, Ginn R, Scotch M, Smith K, Malone D, Gonzalez G. Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from twitter. Drug Saf. 2016;39(3):231–40.

3. Gittelman S, Lange V, Crawford CAG, Okoro CA, Lieb E, Dhingra SS, Trimarchi E. A new source of data for public health surveillance: Facebook likes. J Med Internet Res. 2015;17(4):98.