BACKGROUND
Breast cancer is the most prevalent cancer affecting women of every ethnic group in the United States; however, significant disparities in cancer outcomes persist among different racial and ethnic groups. To date, only a few studies have investigated the treatment outcomes of women from underrepresented populations using clinical notes.
OBJECTIVE
This study aims to develop Natural Language Processing algorithms to capture the treatment outcomes of breast cancer for women from underrepresented populations from clinical notes, to provide individualized symptom management and nursing care planning.
METHODS
We used 1,000 clinical notes obtained from the electronic health records of a tertiary academic hospital. We utilized three vectorization approaches––TF-IDF, Word2Vec, and Doc2Vec––and compared their performances with different classification models––Support Vector Classification, K-Nearest Neighbors, and Random Forest.
RESULTS
The results showed that both the TF-IDF and Doc2Vec models had the highest area under the receiver operating characteristic curve when combined with Random Forest, followed by Support Vector Classification and K-Nearest Neighbors. We observed that the Random Forest classifier showed the best performance among the classification algorithms.
CONCLUSIONS
This study successfully developed a Natural Language Processing pipeline for identifying treatment outcomes of invasive breast cancer in women from underrepresented populations, leveraging the most effective text vectorization (TF-IDF, Doc2Vec) and classification (RF) techniques. Further research is recommended to improve recall performance and confirm its adaptability in diverse, real-world healthcare settings.