Affiliation:
1. SRM Institute of Science and Technology
Abstract
Scientific data available on the internet is rarely labelled. Most popular research paper repository sites contain papers without any annotation for grouping data. Classification of text via words, sentences and even paragraphs has become a key resource for a lot of industries looking to help their computers understand human language – the next stage in Artificial Intelligence. Using valuable Computational Linguistics ideas, some industrial applications have been able to streamline their processes to effectively and efficiently process and interpret language data. Continuing in this trend, in this paper, we aim to effectively clustering scientific research papers into topic-based differentiators, in the most efficient manner. Using multiple algorithms that have revolutionized the industry in the previous years, we compute over 800,000 entries of scientific research articles across 200+ domains that have been uploaded to accurately predict domains for each of these articles. We use clustering techniques like the K-Means algorithm to derive the topics for these papers with an accuracy of nearly 80%. We also use BERT to create topic clusters that generate topics based on frequently occurring contexts within the text. Beyond BERT, we use offspring algorithms that tackle specific, niche issues that BERT does not account for. We also fine-tune the parameters of the algorithms used to generate over 50 stronger topics that more accurately define scientific articles.
Publisher
Trans Tech Publications Ltd