A proposal for identifying multivariate outliers

André Felipe Berdusco Menezes, Josmar Mazucheli, Kelly Vanessa Parede Barco

Abstract


The identification of outliers plays an important role in the statistical analysis, since such observations may contain important information regarding the hypotheses of the study. If classical statistical models are blindly applied to data containing atypical values, the results may be misleading and mistaken decisions can be made. Moreover, in practical situations, the outliers themselves are often the special points of interest and their identification may be the main objective of the investigation. In this way, it was proposed to propose a technique of detection of multivariate outliers, based on cluster analysis and to compare this technique with the method of identification of outliers via Mahalanobis Distance. For data generation, Monte Carlo method simulation and the mixed multivariate normal distribution technique were used. The results presented in the simulations showed that the proposed method was superior to the Mahalanobis method for both sensitivity and specificity, that is, it presented greater ability to correctly diagnose outliers and non-outliers individuals. In addition, the proposed methodology was illustrated with an application in real data from the health area.

Keywords


Outlier; Grouping Analysis; Monte Carlo Method

References


Barco, K. V. P., Mazucheli, J., Janeiro, V. (2017). The inverse power Lindley distribution. Communications in Statistics - Simulation and Computation, 46(8), 6308–6323.

D’Agostino, R. B., Stephens, M. A. (1986). Goodness-of-Fit Techniques. Taylor & Francis.

Dey, S., Mazucheli, J., Nadarajah, S. (2017). Kumaraswamy distribution: Different methods of estimation. Computational and Applied Mathematics, pp. 1–18.

Doornik, J. A. (2007). Object-Oriented Matrix Programming Using Ox, 3rd ed. London: Timberlake Consultants Press and Oxford.

do Espirito-Santo, A. P. J., Mazucheli, J. (2015). Comparison of estimation methods for the Marshall-Olkin extended Lindley distribution. Journal of Statistical Computation and Simulation, 85(17), 3437–3450.

Ghitany, M. E., Atieh, B., Nadarajah, S. (2008). Lindley distribution and its application. Mathematics and Computers in Simulation, 78(4), 493–506.

Ghitany, M. E., Al-Mutairi, D. K., Balakrishnan, N., Al-Enezi, L. J. (2013). Power Lindley distribution and associated inference.

Computational Statistics and Data Analysis, 64, 20–33.

Gupta, R. D., Kundu, D. (2001). Generalized Exponential distribution: Different method of estimations. Journal of Statistical Computation and Simulation, 69(4), 315–337.

Kundu, D., Raqab, M. Z. (2005). Generalized Rayleigh distribution: Different methods of estimations. Computational Statistics & Data Analysis, 49(1), 187–200.

Lehmann, E. J., Casella, G. (1998). Theory of Point Estimation. Springer Verlag.

Lindley, D. V. (1958). Fiducial distributions and Bayes’ theorem. Journal of the Royal Statistical Society, 20(1), 102–107.

Lucenõ, A. (2006). Fitting the Generalized Pareto distribution to data using maximum goodness-of-fit estimators. Computational Statistics & Data Analysis, 51(2), 904–917.

Mahmoud, M. R., Mandouh, R. M. (2013). On the transmuted Fréchet distribution. Journal of Applied Sciences Research, 9(10), 5553–5561.Mazucheli, J., Louzada, F., Ghitany, M. E. (2013). Comparison of estimation methods for the parameters of the weighted Lindley distribution. Applied Mathematics and Computation, 220, 463–471.

Mazucheli, J., Fernandes, L. B., de Oliveira, R. P. (2016). LindleyR: The Lindley Distribution and Its Modifications. URL https://CRAN.R-project.org/package=LindleyR, R package version 1.1.0.

Mazucheli, J., Ghitany, M. E., Louzada, F. (2017). Comparisons of ten estimation methods for the parameters of Marshall-Olkin extended Exponential distribution. Communications in Statistics - Simulation and Computation, 46(7), 5627–5645.

Nadarajah, S., Bakouch, H. S., Tahmasbi, R. (2011). A generalized Lindley distribution. Sankhya B, 73(2), 331–359.

Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press, Oxford.

R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL https://www.R-project.org/.

Rohde, C. A. (2014). Introductory Statistical Inference with the Likelihood Function. Springer-Verlag, New York.

Sharma, V. K., Singh, S. K., Singh, U., Agiwal, V. (2015a). The inverse Lindley distribution: A stress-strength reliability model with application to head and neck cancer data. Journal of Industrial and Production Engineering, 32(3), 162–173.

Sharma, V. K., Singh, S. K., Singh, U., Merovci, F. (2015b). The generalized inverse Lindley distribution: A new inverse statistical model for the study of upside-down bathtub data. Communication in Statistics - Theory and Methods, 45(19), 5709–5729.

Teimouri, M., Hoseini, S. M., Nadarajah, S. (2013). Comparison of estimation methods for the Weibull distribution. Statistics, 47(1), 93–109.




DOI: http://dx.doi.org/10.5902/2179460X27500

Refbacks

  • There are currently no refbacks.