Author:
Drozda Paweł,Ropiak Krzysztof,Nowak Bartosz,Talun Arkadiusz,Osowski Maciej
Abstract
The main aim of this paper is to evaluate crawlers collecting the job offers from websites. In particular the research is focused on checking the effectiveness of ensemble machine learning methods for the validity of extracted position from the job ads. Moreover, in order to significantly reduce the training time of the algorithms (Random Forests and XGBoost), granularity methods were also tested to significantly reduce the input training dataset. Both methods achieved satisfactory results in accuracy and F1 measures, which exceeded 96%. In addition, granulation reduced the input dataset by more than 99%, and the results obtained were only slightly worse (accuracy between 1% and 5%, F1 between 3% and 8%). Thus, it can be concluded that the considered methods can be used in the evaluation of job web crawlers.
Publisher
Uniwersytet Warminsko-Mazurski
Reference22 articles.
1. ARTIEMJEW P., ROPIAK K. 2021. A Novel Ensemble Model – The Random Granular Reflections. Fundam. Informaticae, 179(2): 183-203.
2. CHANG Y.J, TSAI K.L., JIANG W.C., LIU M.K. 2023. Content-aware malicious webpage detection using convolutional neural network. In Multimedia Tools and Applications, p. 1-19. https://doi.org/10.1007/s11042-023-15559-8
3. CHEN T., GUESTRIN C.E. 2016. XGBoost: A Scalable Tree Boosting System. In: KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 785-794. https://doi.org/10.1145/2939672.2939785
4. DROZDA P., TALUN A., BUKOWSKI L. 2019. Emplobot – design of the system. In Proceedings of the 28th International Workshop on Concurrency, Specification and Programming.
5. FINN A., KUSHMERICK N., SMYTH B. 2001. Fact or fiction: Content classification for digital libraries. In Proc. Joint DELOS-NSF Workshop, Personalization Recommender Syst. Digit. Libraries.