Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets-Reference-Cited by-同舟云学术

Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets

Published:2024-04-28 Issue:5 Volume:16 Page:529
ISSN:2073-8994
Container-title:Symmetry
language:en
Short-container-title:Symmetry

Author:

Bruni Renato¹^ORCID,Daraio Cinzia¹^ORCID,Di Leo Simone¹^ORCID

Affiliation:

1. Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Roma, Italy

Abstract

The increasing availability of longitudinal data (repeated numerical observations of the same units at different times) requires the development of flexible techniques to automatically detect errors in such data. Besides standard types of errors, which can be treated with generic error correction techniques, large longitudinal datasets may present specific problems not easily traceable by the generic techniques. In particular, after applying those generic techniques, time series in the data may contain trends, natural fluctuations and possible surviving errors. To study the data evolution, one main issue is distinguishing those elusive errors from the rest, which should be kept as they are and not flattened or altered. This work responds to this need by identifying some types of elusive errors and by proposing a statistical-mathematical approach to capture their complexity that can be applied after the above generic techniques. The proposed approach is based on a system of indicators and works at the formal level by studying the differences between consecutive values of data series and the symmetries and asymmetries of these differences. It operates regardless of the specific meaning of the data and is thus applicable in a variety of contexts. We implement this approach in a relevant database of European Higher Education institutions (ETER) by analyzing two key variables: “Total academic staff” and “Total number of enrolled students”, which are two of the most important variables, often used in empirical analysis as a proxy for size, and are considered by policymakers at the European level. The results are very promising.

Funder

Sapienza research grants

Publisher

MDPI AG

Link

https://www.mdpi.com/2073-8994/16/5/529/pdf

Reference34 articles.

1. OECD (2011). Quality Framework and Guidelines for OECD Statistical Activities, OECD Publishing.

2. Meta-choices in ranking knowledge-based organizations;Daraio;Manag. Decis.,2021

3. Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems;Ballou;Manag. Sci.,1985

4. Data quality assessment;Pipino;Commun. ACM,2002

5. Beyond Accuracy: What Data Quality Means to Data Consumers;Wang;J. Manag. Inf. Syst.,1996