Facing the Clinical Trial Annotation Problem on Breast Cancer: Natural Language Processing & Machine Learning Models Selection Clinical trial classification problem (CTCP) is one of the cutting-edge real life applications in biomedical informatics, specially in the domain considered in this paper, namely breast cancer. The task consists in the development of mod- els able to discriminate patient’s eligibility profile at breast cancer trials based on performance status (PS) labels. The task has gained relevance at medical research and practice in the framework of decision support systems. Besides, the task has been considered a meaningful instrument for an accurate selection of participants at experimentations resulting in no health-beha (Preprint)-Reference-Cited by-同舟云学术

Facing the Clinical Trial Annotation Problem on Breast Cancer: Natural Language Processing & Machine Learning Models Selection Clinical trial classification problem (CTCP) is one of the cutting-edge real life applications in biomedical informatics, specially in the domain considered in this paper, namely breast cancer. The task consists in the development of mod- els able to discriminate patient’s eligibility profile at breast cancer trials based on performance status (PS) labels. The task has gained relevance at medical research and practice in the framework of decision support systems. Besides, the task has been considered a meaningful instrument for an accurate selection of participants at experimentations resulting in no health-beha (Preprint)

Published:2021-07-19 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Reynoso Aguirre Pablo Eliseo

Abstract

BACKGROUND

Now a day, biomedical informatics has gained a high importance through real life applications and academic research (see [1] for a clear overview of such applications). Focusing on clinical trial (CT) of breast cancer, the National Institute of Health (NIH) lists biomedical applications related to breast cancer clinical trials [9]. Each study’s protocol in CT has guidelines for who can or cannot participate in the study. These guidelines, called eligibility criteria (EC), describe characteristics that must be shared by all participants. They may include age, gender, medical history, and current health status. EC for treatment studies often seek a particular type and stage of cancer in patients. Facing the problems involved in these applications consider the following implications: • The complex and non-unified genres variety in which biomedical information is represented, including: electronic heath reports (EHR), medical publications, clinical trial reports (CTR). • The difficulty of extracting, normalizing, and classifying medical entities as drugs, diseases, and other medical-related linguistic patterns (doses, formulas, units). All these implications are presented in CT, documents written for human use. These files frequently contain un-precise information un-useful for medical decision support (MDS). An important task, related to MDS CT, is the automatic computing of Performance Status (PS) of a patient given the textual context on EC content from CTs. PS, a metric to evaluate prospective patient stage of cancer. According to the literature [12], PS can be represented by different scales: Eastern Oncology Group (ECOG) PS [13], Karnofsky PS (KPS) [14], and Lansky PS (LPS) (a particular case of KPS for oncological children studies). These scales describe the stage of cancer of a patient based on their daily physical-behavioral signs. The work in [10] considers KPS as EC classification scale approach at CT breast cancer patients profile annotation. Study propose an algorithm which uses a minimum of two and a maximum of three questions to facilitate an adequate and efficient evaluation of the CT KPS score. System model obtained an average good performance for this type of application. However, their CT classifier suffers from synonymy, polysemy and fuzziness. In [27] ExaCT, researches comprise user assistance locating and extracting key trial characteristics (e.g., EC, sample size, drug dosage, primary outcomes) from full-text journal articles reporting on randomized controlled trials. In [28], study presents a system working on cancer vaccines CT, enabling rapid extraction of information about diseases, clinical approaches, dates to obtain predominant cancer types in the trials. Finally, a former work on the Clinical Trial Classification Problem (CTCP) task [30], solving approach considered a multivariate regression modeling to forecast min & max PS scores of a given CT. Findings on Table 4, suggested both PLS, MLP models, achieved weak classification results in terms of R2 values: min ∈ [0.1116,0.1049]; max ∈ [0.0312,0.0505]. Typical benchmark R2 scores on pure science [29] are R2 ∈ [0.5, 0.75]. Experimentation results denote learning performance high dependency with data representations e.g. complex clinical terms combinations as bigrams, trigrams and a tendency of better generalization on linear models than non-linear approaches.

OBJECTIVE

Exploring different Natural Language Processing & Machine Learning techniques such as: Class Distribution Balancing, Trial Text Pre Processing (Normalization), Feature Extraction (Weighting), Feature Selection (Mapping), ML Models Tuning, Additional Considerations: Sampling Tuning to induce Find KPS_min & KPS_max models by Breast Cancer Randomized Trials Data to Suggest Annotations for New Breast Cancer Trials as Complementary Decision Support Tool for Medical Specialists.

METHODS

* Class Distribution Balancing As it was seen in Figures 1 & 2, both KPS range limits of CT have a high imbalance distribution, particularly on max variable, therefore we proceed using a sampling approach. Since number of samples in minority classes is very low, oversampling seem to be an appropiate framework to tackle the problem. For this task, implementation consider the most generic tehcnique, RandomOversampler, oversampling minority classes occurences up to number of ocurrences of majority class (with replacement, without adding noise to the samples copies) for each classification taks KPSmin & KPSmax. In this stage only MNB is considered to observe the classification outcome based on statistic descriptors. As it can be seen in Tables 7 & 8, no sampling on imbalance data, may seem to achieve better classification results. However Youden−J statistic reflects how generalization of the models differentiate among the different n−classes predicted for either KPS_min, KPS_max. The higher the Youden−J statistic value, the better generalization we obtain in the model to predict all the different classes of KPS range. Therefore, RandomOversampler simple assumption to sample up the number of samples in majority class seem to heal the imbalance problem, and helps the model inference to scape from overfitting the majority class in variable distribution. * Feature Extraction (Weighting) An important part of models inference is the extraction of features for traning the models. Moreover, tasks that involve natural language text require a text embedding (Word2Vec, Sentence2Vec, and Doc2Vec) as numerical matrixes to be a valid input data for Classical Machine Learning algorithms. Deep learning approaches consider methods as Keras Embeddings and Bert that automatically calculate text embeddings using initial weights on the input layer of the networks. In this experimentation phase we consider a Doc2Vec type of embedding to find relationships among CT XML documents based on document words similarities. Doc2Vec can be implemented by different ways of weighting: CountVectorizer, TF-IDF Vectorizer. CountVectorizer, recalls for frequency of word in a given document from CORPUS, while TfidfVectorizer, considers a special wight combining frequency of word in a given document times a normalizing factor of how common the term is for all doc- uments in CORPUS overall. Doc2Vec columns, Bag of Words (BOW) are represented by universe of words in all documents of CORPUS. Bag of Words (BOW) may contain mono-grams, bi-grams, tri-grams, n-grams or combination of them if needed. Doc2Vec rows represent the documents in the CORPUS. After Doc2Vec textual model is implemented on CT documents, the numerical representation of text is known as document term matrix (DTM) (see Figure 7). In this stage only MNB is considered to observe the classification outcome based on statistic descriptors. * Trial Text Pre Processing After experimenting on feature extraction weighting to finding a useful numerical representation of features, a text pre processing stage is considered to boost the feature extraction approaching different techniques of text normalization. All approaches involve Tokenization, and subsequent NLP pre processing methods as: StopWords Removal, Stem- ming Algorithm, English Lemmatization Processes. In this stage only MNB is considered to observe the classification outcome based on statistic descriptors on Count Feature Extraction for both KPS_min, and KPS_max. Different types of pre processing approaches for normalizing text before extract- ing numerical features on Multinomial Naive Bayes (MNB) suggested that for that specific algorithm text normalizations result in better classification results by performing StopWords Removal and Stemming Chunking to posterior extract features by a Count Weighting. The experimentation was extended to implementing all different text normalization approaches, both Count & TF−IDF feature weighting, and different ngram combinations for other ML supervised learning algorithms (default parameters) as: Multinomial Logistic Regression (MLR), Support Vector Machines (SVM) and Multilayer Perceptron Neural Network (MLP). The extended experimental results are shown in Tables 13, 14. * Feature Selection (Mapping) In terms of data projections, SVD algorithm was considered to project existing numerical features to a more separable space. SVD have been proved as good for experiments as this one with sparse data (has non-numerical nature - text vectorizations) and curse of dimensionality issues e.g.( # f eatures > #samples ). In this experiment feature mapping consider different numbers of meta-features for data projections [100, 150, 200, 250, 300] on the results obtained for every text preprocessing, feature weighitng, ngram settings and algorithms on previous stage. After analyzing classification results in which ngram features are projected into a more compact dimensional space, we obtained the following results for the best SVD configuration on every algorithm prediction for KPS_min, KPS_max: After comparing the results of Trial Text Pre Processing from Tables 13 & 14 with the results of Feature Selection (Mapping) of Tables 15 & 16 respectively we observed that statistical metrics did not improved, therefore the SVD feature projections does not seem to be an efficient approach to boost classification performance metrics (Accuracy, F1- Score, Youden-J Statistic) scoring. * Models Tuning In the following section experimentation related to model hyper parameters tuning we consider a RandomizedSearchCV approach from a model selection framework to explore different combinations of parameters values in order to find settings that optimize classification performance metrics from former stages. ** On Multinomial Naive Bayes, hyper-parameters and settings considered for testing are: alpha in [0.005, 5.000], class prior = None, fit prior = True. ** On Multinomial Logistic Regression, hyper-parameters and settings considered for testing are: penalty = L2, C in [0.005, 6.000], tol in [0.0001, 0.2000], dual = False, solver in [lbfgs, sag, saga], multi class in [ovr, multinomial, auto]. ** On Support Vector Machines, hyper-parameters and settings considered for test- ing are: penalty = L2, C in [0.005, 10.000], tol in [0.0001, 0.2000], dual in [True, False], max iter in [1, 10], multi class = ovr, random state = 0, loss = squared hinge. ** On Multilayer Perceptron, hyper-parameters and settings considered for testing are: activation in [identity, logistic, tanh, relu], hidden layer sizes in [(5, 1), (10, 1), (15, 1), (20, 1), (25, 1), (50, 1), (100, 1), (200, 1)], alpha in [0.0001, 0.2000], tol in [0.0001, 0.2000], learning rate in [constant, invscaling, adaptive], solver = lbfgs, max iter in [10, 25, 50, 100]. After trying different combinations of model hyper-parameters along generic (up-to- majority) class oversampling, different text pre processing, different feature extraction, different n-gram representations and different feature selection (mapping), the following results were obtained: * Additional Considerations: Sampling Tuning After all the experimentation performed in previous stages to find relevant results, we performed additional considerations to maximize the accuracy results and learning generalization by adjusting sampling framework on both KPS_min & KPS_max. The class distribution balancing considered oversampling of minority classes on difference percentages ranges [5% - 25%] in relation with majority class [31]. This implementation only considered monograms (1,1) features in order to avoid The Curse of Dimensionality Problem, since sampling tuning considered between 4-6 times less samples than sampling strategy on Section 4.1. In the results found we can observe an improvement on KPS_max classification performance metrics. However, KPS_min seem to generalize better on n-gram [(1,2), (1,3)] features data representations than monogram representations.

RESULTS

After all the experimentation performed in previous stages to find relevant results, the best generalization found for the models (MNB, MLR, SVM, MLP) on KPS_min & KPS_max annotation tasks are: *Final KPS_min Models Algorithm Pre Processing Ngram Precision Recall Accuracy F1−Score Youden−J Statistic (Informedness) MNB SW (1,3) 0.8572 0.9070 0.9070 0.8754 0.8141 MLR SW/Stem (1,3) 0.8179 0.8889 0.8889 0.8501 0.7778 SVM SW (1,3) 0.8242 0.8930 0.8930 0.8552 0.7861 MLP Lema (1,3) 0.8553 0.9048 0.9048 0.8736 0.8097 *Final KPS_max Models Algorithm Pre Processing Ngram Precision Recall Accuracy F1−Score Youden−J Statistic (Informedness) MNB SW/Stem (1,1) 0.9299 0.9215 0.9215 0.9232 0.9114 MLR SW/Stem (1,1) 0.9172 0.8950 0.8930 0.8950 0.9312 SVM SW/Stem (1,1) 0.9260 0.8949 0.8966 0.8949 0.9341 MLP SW (1,1) 0.9476 0.9374 0.9374 0.9387 0.9525

CONCLUSIONS

After analyzing final learning performance final results we can observe the following key points: • Both KPSmin & KPSmax generalization perform better on Count Vectorization (fre- quency weighting) is considered to build Document Term Matrixes. • Both KPSmin & KPSmax generalization perform better on StopWords Removal Text Pre Processing. • KPSmin Annotation Task has a better generalization performance when minority classes oversampled up to 100% majority class framework is considered to heal data imbalance, and combinations of ngrams (single, two, three) features frequen- cies are considered as feature extraction. • KPSmax Annotation Task has a better generalization performance when minority classes oversampled up to [5% - 25%] majority class framework is considered to heal data imbalance, and monograms (single word) feature frequencies are consid- ered as feature extraction. • On KPSmax Annotation Task, best learning performance found is: 1. Class Imbalance: minority classes Oversampling to 15% of majority class. 2. Text Pre Processing: StopWords Removal. 3. Feature Extraction: Count Vectorizer (1,1) mono-grams. 4. Model: Multilayer Perceptron. 5. Settings: MLPClassifier(activation=’relu’, hidden layer sizes=(100,1), al- pha=0.0001, tol=0.0001, learning rate=’constant’, solver=’adam’, max iter=200) 6. Accuracy: 0.9374, F1-Score: 0.9387, Youden-J(Informedness): 0.9525 • On KPSmin Annotation Task, best learning performance found is: 1. Class Imbalance: minority classes Oversampling to 100% of majority class. 2. Text Pre Processing: StopWords Removal. 3. Feature Extraction: Count Vectorizer (1,3) mono-grams, bi-grams, tri-grams. 4. Model: Multinomial Naive Bayes. 5. Settings: MultinomialNB(alpha=0.0000000001, class prior=None, fit prior=True) 6. Accuracy: 0.9070, F1-Score: 0.8754, Youden-J(Informedness): 0.8141 • The best decision support models for annotation of trials found after all of the experimentation done in different stages seem to be: MLPClassifier for KPSmax, MultinomialNB for KPSmin achieving multi-class accuracy scores of 0.9374 & 0.9070 respectively.

CLINICALTRIAL

Country Cancer Breast Cancer U.S. Only 31268 4316 Non-U.S. 29544 3791 Total 60812 8107 Performance Status #Samples #Features KPS 3767 15296 ECOG 4023 15296 clinicaltrials.gov

Publisher

JMIR Publications Inc.

Reference17 articles.

1. What can natural language processing do for clinical decision support?

2. Medical Question Answering for Clinical Decision Support

3. PAnG

4. In silico analysis of natural compounds targeting structural and nonstructural proteins of chikungunya virus

5. Appraisal of the Karnofsky Performance Status and proposal of a simple algorithmic system for its evaluation