Abstract
To improve the analysis of respondent comments from the Canadian Census of Population, data scientists at Statistics Canada compared and evaluated traditional machine learning, deep learning and transformer-based techniques. Cross-lingual Language Model-Robustly Optimized Bidirectional Encoder Representations from Transformers (XLM-R), a cross-lingual language model, fine-tuned on census respondent comments yield the best result of 89.91% F1 score overall despite language and class imbalances. Following the evaluation, the fine-tuned model was implemented successfully to objectively categorize comments from the 2021 Census of Population, with high accuracy. As a result, feedback from respondents was directed to the appropriate subject matter analysts, for them to analyze post-collection.
Subject
Statistics, Probability and Uncertainty,Economics and Econometrics,Management Information Systems
Reference16 articles.
1. An overview of rural and small town Canada;Bollman;Canadian Journal of Agricultural Economics/Revue Canadienne D’agroeconomie.,1991
2. Steffler J. The indigenous data landscape in Canada: An overview. Aboriginal Policy Studies. 2016.
3. Jones KS. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. 1972.
4. Text categorization with support vector machines: Learning with many relevant features;Joachims;In European Conference on Machine Learning. Springer.,1998
5. One-class SVMs for document classification;Manevitz;Journal of Machine Learning Research.,2001