Author:
Heusinger Moritz,Raab Christoph,Schleif Frank-Michael
Abstract
AbstractIn recent years social media became an important part of everyday life for many people. A big challenge of social media is, to find posts, that are interesting for the user. Many social networks like Twitter handle this problem with so-called hashtags. A user can label his own Tweet (post) with a hashtag, while other users can search for posts containing a specified hashtag. But what about finding posts which are not labeled by the creator? We provide a way of completing hashtags for unlabeled posts using classification on a novel real-world Twitter data stream. New posts will be created every second, thus this context fits perfectly for non-stationary data analysis. Our goal is to show, how labels (hashtags) of social media posts can be predicted by stream classifiers. In particular, we employ random projection (RP) as a preprocessing step in calculating streaming models. Also, we provide a novel real-world data set for streaming analysis called NSDQ with a comprehensive data description. We show that this dataset is a real challenge for state-of-the-art stream classifiers. While RP has been widely used and evaluated in stationary data analysis scenarios, non-stationary environments are not well analyzed. In this paper, we provide a use case of RP on real-world streaming data, especially on NSDQ dataset. We discuss why RP can be used in this scenario and how it can handle stream-specific situations like concept drift. We also provide experiments with RP on streaming data, using state-of-the-art stream classifiers like adaptive random forest and concept drift detectors. Additionally, we experimentally evaluate an online principal component analysis (PCA) approach in the same fashion as we do for RP. To obtain higher dimensional synthetic streams, we use random Fourier features (RFF) in an online manner which allows us, to increase the number of dimensions of low dimensional streams.
Funder
Bayerisches Staatsministerium für Wirtschaft, Landesentwicklung und Energie
European Social Fund
Hochschule für angewandte Wissenschaften Würzburg-Schweinfurt
Publisher
Springer Science and Business Media LLC
Subject
Control and Optimization,Computer Science Applications,Modelling and Simulation,Control and Systems Engineering
Reference46 articles.
1. Achlioptas D (2001) Database-friendly random projections. In: Proc. of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, pp 274–281
2. Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66:671–687
3. Aggarwal CC (2014) A survey of stream classification algorithms. In: Aggarwal CC (ed) Data classification: algorithms and applications. CRC Press, Boca Raton, pp 245–274
4. Baena-Garcıa M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavalda R, Morales-Bueno R (2006) Early drift detection method. In: Fourth international workshop on knowledge discovery from data streams, vol 6, pp 77–86
5. Bifet A, Gavaldà R, Holmes G, Pfahringer B (2018) Machine learning for data streams with practical examples in MOA. MIT Press, Cambridge
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献