Affiliation:
1. Queen Mary University of London, London, United Kingdom
Abstract
The unmoderated nature of social media enables the diffusion of hoaxes, which in turn jeopardises the credibility of information gathered from social media platforms. Existing research on automated detection of hoaxes has the limitation of using relatively small datasets, owing to the difficulty of getting labelled data. This, in turn, has limited research exploring early detection of hoaxes as well as exploring other factors such as the effect of the size of the training data or the use of sliding windows. To mitigate this problem, we introduce a semi-automated method that leverages the Wikidata knowledge base to build large-scale datasets for veracity classification, focusing on celebrity death reports. This enables us to create a dataset with 4,007 reports including over 13M tweets, 15% of which are fake. Experiments using class-specific representations of word embeddings show that we can achieve F1 scores nearing 72% within 10 minutes of the first tweet being posted when we expand the size of the training data following our semi-automated means. Our dataset represents a realistic scenario with a real distribution of true, commemorative, and false stories, which we release for further use as a benchmark in future research.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications
Reference67 articles.
1. Detecting breaking news rumors of emerging topics in social media
2. A neural probabilistic language model;Bengio Yoshua;J. Mach. Learn. Res. 3,2003
3. Blogs, Twitter, and breaking news: The produsage of citizen journalism;Bruns Axel;Produs. Theor. Dig. World: Intersect. Aud. Prod. Contemp. Theor.,2012
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献