Preenchimento de valores faltantes em séries temporais utilizando árvores de decisão

Alisson Silva Neimaier; Taiane Schaedler Prass

doi:10.5902/2179460X84257

Missing values imputation in time series using decision trees

Authors

Alisson Silva Neimaier Universidade Federal do Rio Grande do Sul https://orcid.org/0000-0002-7524-0776
Taiane Schaedler Prass Universidade Federal do Rio Grande do Sul https://orcid.org/0000-0003-3136-909X

DOI:

https://doi.org/10.5902/2179460X84257

Keywords:

ARMA, Random walk, Decision trees, Missing data, Imputation

Abstract

Filling in missing values in time series is a problem that has received little attention. The studies found in the literature generally focus on linear models from the ARIMA family and do not discuss the validity of proposed methodologies for cases with a large volume of missing data, in which parametric methods become challenging due to the additional problem of identifying the order of the model. To address these issues, this study proposes a methodology for time series reconstruction using decision trees, a machine learning method that does not assume a parametric model for the data. In this approach, the known values of the time series act as the response variable, while corresponding lags are used as predictors. The tree selected by the training algorithm is then used to predict the missing values in the response. Monte Carlo simulations are used to investigate the proposed methodology, considering processes from the ARMA family and the random walk while varying the size of the time series, model parameters, proportion of missing values, and the predictors. To evaluate the quality of the reconstructions, the predictions of the decision trees are compared with those of some traditional imputation methods. The results demonstrate the potential of the proposed method and are consistent with the theoretical framework of this study. To promote the proposed methodology, a shiny application has been developed and made publicly available.

Downloads

Download data is not yet available.

Author Biographies

Alisson Silva Neimaier, Universidade Federal do Rio Grande do Sul

Master in Statistics from Universidade Federal do Rio Grande do Sul - UFRGS (2022-2024).

Taiane Schaedler Prass, Universidade Federal do Rio Grande do Sul

Post-Doctorate in Mathematics from Universidade Federal do Rio Grande do Sul.

References

Batista, G., Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17, 519–533. DOI: https://doi.org/10.1080/713827181

Breiman, L., Friedman, J., Stone, C., Olshen, R. (1984). Classification and Regression Trees. Taylor & Francis.

Brockwell, P. J., Davis, R. A. (1991). Time Series: Theory and Methods, 2o ed. Springer Science & Business Media. DOI: https://doi.org/10.1007/978-1-4419-0320-4

Chang, W., Cheng, J., Allaire, J., Sievert, C., Schloerke, B., Xie, Y., Allen, J., McPherson, J., Dipert, A., Borges, B. (2021). shiny: Web Application Framework for R. URL https://CRAN.R-project.org/package=shiny , r package version 1.7.1.

Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39(1), 1–38. DOI: https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Dergachev, V. A., Gorban, A. N., Rossiev, A. A., Karimova, L. M., Kuandykov, E. B., Makarenko, N. G., Steier, P. (2001). The filling of gaps in geophysical time series by artificial neural networks. Radiocarbon, 43(2A), 365–371. DOI: https://doi.org/10.1017/S0033822200038224

Greiner, R., Grove, A., Kogan, A. (1997). Knowing what doesn’t matter: exploiting the omission of irrelevant data. Artificial Intelligence, 97(1-2), 345–380. DOI: https://doi.org/10.1016/S0004-3702(97)00048-9

Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. DOI: https://doi.org/10.1007/978-0-387-84858-7

James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics, Springer New York. DOI: https://doi.org/10.1007/978-1-4614-7138-7

Josse, J., Prost, N., Scornet, E., Varoquaux, G. (2019). On the consistency of supervised learning with missing values. arXiv:190206931.

Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 20(2), 119–127. DOI: https://doi.org/10.2307/2986296

Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association, 90(431), 1112–1121. DOI: https://doi.org/10.1080/01621459.1995.10476615

Ljung, G. M. (1989). A note on the estimation of missing values in time series. Communications in Statistics - Simulation and Computation, 18(2), 459–465. DOI: https://doi.org/10.1080/03610918908812770

Luceño, A. (1997). Estimation of missing values in possibly partially nonstationary vector time series. Biometrika, 84(2), 495–499. DOI: https://doi.org/10.1093/biomet/84.2.495

Molenberghs, G., Fitzmaurice, G. M., Kenward, M. G., Tsiatis, A. A., Verbeke, G. (2020). Handbook of Missing Data Methodology. Chapman & Hall/CRC Handbooks of Modern Statistical Methods, Taylor & Francis Group.

Morettin, P. A., Toloi, C. M. d. C. (2004). Análise de séries temporais. Edgard Blucher.

Moritz, S., Bartz-Beielstein, T. (2017). imputeTS: Time Series Missing Value Imputation in R. The R Journal, 9(1), 207–218. DOI: https://doi.org/10.32614/RJ-2017-009

Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning series, MIT Press.

Prass, T. S., Pumi, G. (2021). On the behavior of the DFA and DCCA in trend-stationary processes. Journal of Multivariate Analysis, 182, 104,703. DOI: https://doi.org/10.1016/j.jmva.2020.104703

Pratama, I., Permanasari, A., Ardiyanto, I., Indrayani, R. (2016). A review of missing values handling methods on time-series data. Em: 2016 International Conference on Information Technology Systems and Innovation (ICITSI), pp. 1–6. DOI: https://doi.org/10.1109/ICITSI.2016.7858189

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. DOI: https://doi.org/10.1007/BF00116251

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL https://www.R-project.org/.

RStudio Team (2022). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA, URL http://www.rstudio.com/.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. DOI: https://doi.org/10.1093/biomet/63.3.581

Shumway, R. H., Stoffer, D. S. (2005). Time Series Analysis and Its Applications (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg.

Therneau, T., Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees. URL https://CRAN.R-project. org/package=rpart, r package version 4.1-15.

Van der Vaart, A. W. (2010). Time series. Lecture notes for courses “Tijdreeksen”, “Time Series” and “Financial Time Series” held at Vrije Universiteit Amsterdam, 1995-2010. URL https://staff.fnwi.uva.nl/p.j.c.spreij/onderwijs/master/aadtimeseries2010.pdf.

Yodah, Kihoro, J., Athiany, H., W, W., Kibunja (2013). Imputation of incomplete non-stationary seasonal time series data. Mathematical Theory and Modeling, 3, 142–154.

Downloads

PDF (Português (Brasil))

Published

2024-11-29

How to Cite

Neimaier, A. S., & Prass, T. S. (2024). Missing values imputation in time series using decision trees. Ciência E Natura, 46, e84257. https://doi.org/10.5902/2179460X84257

Download Citation

Issue

Vol. 46 (2024): Publicação contínua

Section

Statistics

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

To access the DECLARATION AND TRANSFER OF COPYRIGHT AUTHOR’S DECLARATION AND COPYRIGHT LICENSE click here.

Ethical Guidelines for Journal Publication

The Ciência e Natura journal is committed to ensuring ethics in publication and quality of articles.

Conformance to standards of ethical behavior is therefore expected of all parties involved: Authors, Editors, Reviewers, and the Publisher.

In particular,

Authors: Authors should present an objective discussion of the significance of research work as well as sufficient detail and references to permit others to replicate the experiments. Fraudulent or knowingly inaccurate statements constitute unethical behavior and are unacceptable. Review Articles should also be objective, comprehensive, and accurate accounts of the state of the art. The Authors should ensure that their work is entirely original works, and if the work and/or words of others have been used, this has been appropriately acknowledged. Plagiarism in all its forms constitutes unethical publishing behavior and is unacceptable. Submitting the same manuscript to more than one journal concurrently constitutes unethical publishing behavior and is unacceptable. Authors should not submit articles describing essentially the same research to more than one journal. The corresponding Author should ensure that there is a full consensus of all Co-authors in approving the final version of the paper and its submission for publication.

Editors: Editors should evaluate manuscripts exclusively on the basis of their academic merit. An Editor must not use unpublished information in the editor's own research without the express written consent of the Author. Editors should take reasonable responsive measures when ethical complaints have been presented concerning a submitted manuscript or published paper.

Reviewers: Any manuscripts received for review must be treated as confidential documents. Privileged information or ideas obtained through peer review must be kept confidential and not used for personal advantage. Reviewers should be conducted objectively, and observations should be formulated clearly with supporting arguments, so that Authors can use them for improving the paper. Any selected Reviewer who feels unqualified to review the research reported in a manuscript or knows that its prompt review will be impossible should notify the Editor and excuse himself from the review process. Reviewers should not consider manuscripts in which they have conflicts of interest resulting from competitive, collaborative, or other relationships or connections with any of the authors, companies, or institutions connected to the papers.

Missing values imputation in time series using decision trees

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Alisson Silva Neimaier, Universidade Federal do Rio Grande do Sul

Taiane Schaedler Prass, Universidade Federal do Rio Grande do Sul

References

Downloads

Published

How to Cite

Issue

Section

License

Owned and Managed by

Make a Submission

About the Journal

clustrmaps

Language

Current Issue