Estimator for generalization performance of machine learning model trained by biased data collected from multiple references-Reference-Cited by-同舟云学术

Estimator for generalization performance of machine learning model trained by biased data collected from multiple references

Published:2023-03-10 Issue:15 Volume:38 Page:2145-2162
ISSN:1093-9687
Container-title:Computer-Aided Civil and Infrastructure Engineering
language:en
Short-container-title:Computer aided Civil Eng

Author:

Okazaki Yuriko¹,Okazaki Shinichiro¹,Asamoto Shingo²,Yamaji Toru³,Ishige Minoru²

Affiliation:

1. Faculty of Engineering and Design Kagawa University Kagawa Japan

2. Graduate School of Science and Engineering Saitama University Saitama Japan

3. Structural Engineering Department Port and Airport Research Institute Yokosuka Kanagawa Japan

Abstract

AbstractThe data acquired in civil engineering tasks often involve high acquisition costs, and the available datasets tend to have a limited number of samples and are highly biased. To estimate the performance of machine learning models, k‐fold cross‐validation (k‐CV) is widely used. However, if only limited data are available and the data distribution is biased, k‐CV tends to overestimate the performance for practical applications. This study proposed a new estimator, leave one reference out and k‐CV (LORO‐k‐CV), to determine the practical performance of machine learning models, that is, the generalization performance for population data in the target task, in case data are collected by multiple references resulting in biased data. LORO‐k‐CV is a combination of a new concept, LORO‐CV, that estimates the performance in the extrapolation region of the training data without human intervention and k‐CV, considering the ratio of the interpolation and extrapolation regions. The efficacy of LORO‐k‐CV was validated with its application to the regression task for the chloride‐ion concentration of concrete structures. To more specifically demonstrate the advantages of LORO‐k‐CV in model construction, the feature selections were conducted using both k‐CV and LORO‐k‐CV methods. These results revealed that LORO‐k‐CV can effectively construct a model with improved generalization performance even from the same data in cases where data are collected by multiple references, resulting in biased data.

Publisher

Wiley

Subject

Computational Theory and Mathematics,Computer Graphics and Computer-Aided Design,Computer Science Applications,Civil and Structural Engineering,Building and Construction

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1111/mice.12992

Reference81 articles.

1. Perceptron Learning in Engineering Design

2. An Empirical Comparison of Machine Learning Models for Time Series Forecasting

3. A dynamic ensemble learning algorithm for neural networks

4. Effect of Curing Conditions on the Service Life Design of RC Structures in the Persian Gulf Region

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Performance‐driven contractor recommendation system using a weighted activity–contractor network;Computer-Aided Civil and Infrastructure Engineering;2024-08-29

2. A smoothness control method for kilometer‐span railway bridges with analysis of track characteristics;Computer-Aided Civil and Infrastructure Engineering;2024-04-30

3. An integration–competition network for bridge crack segmentation under complex scenes;Computer-Aided Civil and Infrastructure Engineering;2023-10-16

4. Machine Learning in Predicting Printable Biomaterial Formulations for Direct Ink Writing;Research;2023-01