A method of multi-label text classifier at the publication level for cancer literature (Preprint)

Author:

Zhang YingORCID,Li XiaoyingORCID,Liu Yi,Li Aihua,Yang Xuemei,Tang Xiaoli

Abstract

BACKGROUND

Given the threat posed by cancer to human health, there is rapid growth in the volume of data in the cancer field along with increasing attention being paid to interdisciplinary and cooperative research. The low-resolution classifier of reported research at the journal level fails to satisfy the advanced research demands and a single label does not adequately characterize the literature. There is thus a need to establish a multi-label classifier with higher resolution to support cancer research.

OBJECTIVE

This paper presents a multi-label classifier with scalability for classifying literature on cancer research directly at the publication level and assign proper content-based labels, in order to support the highest-resolution classification. This model could be used to support academic statistics and solve the low-resolution problem of subject classification of the cancer research due to ambiguity of the journal-level classifier.

METHODS

We propose a new effective probabilistic classifier for literature classification by introducing the model of “BERT + X” and obtain the best option for “X,” namely, TextRNN. Firstly, a corpus of 50,000 data collected from DIMENSIONS was divided into a training set and a test set at a ratio of 7:3. Secondly, using ICRP CT, a classification for cancer, we compared the performance of classifiers formed by BERT and classical deep learning models such as recurrent neural networks (RNN), convolutional neural networks (CNN), TextRNN, TextCNN, and FastText, followed by metrics analysis. Finally, we conclude that the model of “BERT + TextRNN” is the best fit for multi-label classifier of cancer research and areas with similar text structure characteristics and label distribution features at the publication level by means of visualization and statistical analysis.

RESULTS

Based on the “BERT + X”, we trained a multi-label classifier model of classifying literature at the publication level directly, rather than categorization from coarse to fine; after comparing various constructed models, the classifier was obtained based on the optimal model “BERT + TextRNN” which could be directly applied in production and research, with P = 0.9142, R = 0.8560, F1 = 0.8842. Moreover, we discussed why the model would be effective in the cancer field, found that the articles published in this field have distinctive characteristics in text structure and label distribution, and concluded through quantitative analysis that the model has the potential to be generalized to other fields with similar characteristics.

CONCLUSIONS

This paper presents a scalable and extensible model that is suitable for high-resolution subject classifier of the cancer literature at the publication level, based on “BERT + TextRNN.” The model is also applicable to other literature with highly professional, systematic, and uniform long-form standardized text. Verification of the multi-label classifier for literature at the publication level indicates that it could provide effective support for academic statistics and clinical research.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3