BACKGROUND
Given the threat posed by cancer to human health, there is rapid growth in the volume of data in the cancer field along with increasing attention being paid to interdisciplinary and cooperative research. The low-resolution classifier of reported research at the journal level fails to satisfy the advanced research demands and a single label does not adequately characterize the literature. There is thus a need to establish a multi-label classifier with higher resolution to support cancer research.
OBJECTIVE
This paper presents a multi-label classifier with scalability for classifying literature on cancer research directly at the publication level and assign proper content-based labels, in order to support the highest-resolution classification. This model could be used to support academic statistics and solve the low-resolution problem of subject classification of the cancer research due to ambiguity of the journal-level classifier.
METHODS
We propose a new effective probabilistic classifier for literature classification by introducing the model of “BERT + X” and obtain the best option for “X,” namely, TextRNN. Firstly, a corpus of 50,000 data collected from DIMENSIONS was divided into a training set and a test set at a ratio of 7:3. Secondly, using ICRP CT, a classification for cancer, we compared the performance of classifiers formed by BERT and classical deep learning models such as recurrent neural networks (RNN), convolutional neural networks (CNN), TextRNN, TextCNN, and FastText, followed by metrics analysis. Finally, we conclude that the model of “BERT + TextRNN” is the best fit for multi-label classifier of cancer research and areas with similar text structure characteristics and label distribution features at the publication level by means of visualization and statistical analysis.
RESULTS
Based on the “BERT + X”, we trained a multi-label classifier model of classifying literature at the publication level directly, rather than categorization from coarse to fine; after comparing various constructed models, the classifier was obtained based on the optimal model “BERT + TextRNN” which could be directly applied in production and research, with P = 0.9142, R = 0.8560, F1 = 0.8842. Moreover, we discussed why the model would be effective in the cancer field, found that the articles published in this field have distinctive characteristics in text structure and label distribution, and concluded through quantitative analysis that the model has the potential to be generalized to other fields with similar characteristics.
CONCLUSIONS
This paper presents a scalable and extensible model that is suitable for high-resolution subject classifier of the cancer literature at the publication level, based on “BERT + TextRNN.” The model is also applicable to other literature with highly professional, systematic, and uniform long-form standardized text. Verification of the multi-label classifier for literature at the publication level indicates that it could provide effective support for academic statistics and clinical research.