A method of multi-label text classifier at the publication level for cancer literature (Preprint)-Reference-Cited by-同舟云学术

A method of multi-label text classifier at the publication level for cancer literature (Preprint)

Published:2022-12-07 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Zhang Ying^ORCID,Li Xiaoying^ORCID,Liu Yi,Li Aihua,Yang Xuemei,Tang Xiaoli

Abstract

BACKGROUND

Given the threat posed by cancer to human health, there is rapid growth in the volume of data in the cancer field along with increasing attention being paid to interdisciplinary and cooperative research. The low-resolution classifier of reported research at the journal level fails to satisfy the advanced research demands and a single label does not adequately characterize the literature. There is thus a need to establish a multi-label classifier with higher resolution to support cancer research.

OBJECTIVE

This paper presents a multi-label classifier with scalability for classifying literature on cancer research directly at the publication level and assign proper content-based labels, in order to support the highest-resolution classification. This model could be used to support academic statistics and solve the low-resolution problem of subject classification of the cancer research due to ambiguity of the journal-level classifier.

METHODS

We propose a new effective probabilistic classifier for literature classification by introducing the model of “BERT + X” and obtain the best option for “X,” namely, TextRNN. Firstly, a corpus of 50,000 data collected from DIMENSIONS was divided into a training set and a test set at a ratio of 7:3. Secondly, using ICRP CT, a classification for cancer, we compared the performance of classifiers formed by BERT and classical deep learning models such as recurrent neural networks (RNN), convolutional neural networks (CNN), TextRNN, TextCNN, and FastText, followed by metrics analysis. Finally, we conclude that the model of “BERT + TextRNN” is the best fit for multi-label classifier of cancer research and areas with similar text structure characteristics and label distribution features at the publication level by means of visualization and statistical analysis.

RESULTS

Based on the “BERT + X”, we trained a multi-label classifier model of classifying literature at the publication level directly, rather than categorization from coarse to fine; after comparing various constructed models, the classifier was obtained based on the optimal model “BERT + TextRNN” which could be directly applied in production and research, with P = 0.9142, R = 0.8560, F1 = 0.8842. Moreover, we discussed why the model would be effective in the cancer field, found that the articles published in this field have distinctive characteristics in text structure and label distribution, and concluded through quantitative analysis that the model has the potential to be generalized to other fields with similar characteristics.

CONCLUSIONS

This paper presents a scalable and extensible model that is suitable for high-resolution subject classifier of the cancer literature at the publication level, based on “BERT + TextRNN.” The model is also applicable to other literature with highly professional, systematic, and uniform long-form standardized text. Verification of the multi-label classifier for literature at the publication level indicates that it could provide effective support for academic statistics and clinical research.

Publisher

JMIR Publications Inc.

Reference38 articles.

1. Written information given to patients and families by palliative care units: a national survey

2. Discipline-building in synthetic biology

3. Deep Learning--based Text Classification

4. Joint Embedding of Words and Labels for Text Classification

5. Very Deep Convolutional Networks for Text Classification