Abstract
AbstractThe automated field of research classification for scientific papers is still challenging, even with modern tools such as large language models. As part of a shared task tackling this problem, this paper presents our contribution SLAMFORC, an approach to single-label classification using multi-modal data. We combined the metadata of papers with their full text and, where available, images into a pipeline to predict their field of research with an ensemble voting on traditional classifiers and large language models. We evaluated our approach on the shared task dataset and scored the highest values for two of the four metrics used in the evaluation of the competition, with the other two being the second highest.
Publisher
Springer Nature Switzerland
Reference30 articles.
1. Abu Ahmad, R., Borisova, E., Rehm, G.: FoRC-Subtask-I@NSLP2024 Testing Data (2024). https://doi.org/10.5281/zenodo.10469550
2. Abu Ahmad, R., Borisova, E., Rehm, G.: FoRC-Subtask-I@NSLP2024 Training and Validation Data (2024). https://doi.org/10.5281/zenodo.10438530
3. Auer, S., et al.: Improving access to scientific literature with knowledge graphs. Bibliothek Forschung und Praxis 44(3), 516–529 (2020)
4. Balabantaray, R.C., Sarma, C., Jha, M.: Document clustering using k-means and k-medoids. CoRR abs/1502.07938 (2015). http://arxiv.org/abs/1502.07938
5. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019, pp. 3613–3618. Association for Computational Linguistics (2019). https://doi.org/10.18653/V1/D19-1371