Educational inequalities in enem: a perspective based on socioeconomic variables and machine learning

Authors

DOI:

https://doi.org/10.5902/2318133893251

Keywords:

Microdados do Enem, Random forest, Aprendizagem de máquina

Abstract

The National High School Exam serves as an important gateway to higher education in Brazil. This study examines the relationship between socioeconomic variables and student performance on the exam, employing machine learning techniques to identify significant patterns. The research has three main objectives to develop predictive models based on random forest to classify student performance to identify the most relevant socioeconomic variables and to analyze their impact on results, aiming to inform more equitable educational policies. The study used Enem 2023 microdata, which underwent preprocessing including one-hot encoding for certain variables and Smote for balancing. Ten random forest models were built, with hyperparameter tuning via random search. Performance was evaluated using metrics such as accuracy, precision, recall, and F1-score, along with variable importance analysis. The models demonstrated satisfactory performance, with accuracy around 94% and precision up to 99%. Parental education level, occupation, and family income emerged as key predictors. Students with more educated parents in strategic professions were three times more likely to achieve high performance, while those from low-income families showed greater tendency toward unsatisfactory results. The findings highlight the influence of socioeconomic factors on educational performance, underscoring the need for appropriate public policies. The models' effectiveness confirms their utility for educational diagnostics.

Downloads

Download data is not yet available.

Author Biographies

Marcelo de Souza, Universidade do Estado de Santa Catarina

Professor Adjunto do Departamento de Engenharia de Software e do Programa de Pós Graduação em Gestão da Informação da Universidade do Estado de Santa Catarina. Possui mestrado e doutorado em Ciência da Computação pela Universidade Federal do Rio Grande do Sul, e graduação em Bacharelado em Sistemas de Informação pela Universidade do Estado de Santa Catarina, com período sanduíche realizado na Universidade de León (Espanha). Também atuou como pesquisador visitante na Alliance Manchester Business School da Universidade de Manchester (Reino Unido). Trabalha nas áreas de inteligência artificial, otimização combinatória, algoritmos e grafos.

Daniel Larion Klug, Universidade do Estado de Santa Catarina

Bacharel em Engenharia de Software pela Universidade do Estado de Santa Catarina.

References

BERGSTRA, James; BENGIO, Yoshua. Random search for hyper-parameter optimization. Journal of Machine Learning Research, Brookline, v. 13, n. 2, 2012, p. 281-305.

BREIMAN, Leo. Random forests. Machine Learning, Berlim, v. 45, 2001, p. 5-32. DOI: https://doi.org/10.1023/A:1010933404324

CHAWLA, Nitesh V; BOWYER, Kevin W; HALL, Lawrence O; KEGELMEYER, W. Philip. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, El Segundo, v. 16, 2002, p. 321-357. DOI: https://doi.org/10.1613/jair.953

MEC. Enem: Exame Nacional do Ensino Médio 2023. Disponível em: https://www.gov.br/inep/pt-br/areas-de-atuacao/avaliacao-e-exames-educacionais/enem. Acesso em: 18 set. 2023.

SEGER, Christian. An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science: Stockholm, Sweden, 2018.

HE, Haibo; GARCIA, Edwardo. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, Los Alamitos, v. 21, n. 9, 2009, p. 1263-1284. DOI: https://doi.org/10.1109/TKDE.2008.239

KRAWCZYK, Bartosz. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, Heidelberg, v. 5, n. 4, 2016, p. 221-232. DOI: https://doi.org/10.1007/s13748-016-0094-0

FERNÁNDEZ, Alberto; GARCIA, Salvador; HERRERA, Francisco; CHAWLA, Nitesh. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, El Segundo, v. 61, 2018, p. 863-905. DOI: https://doi.org/10.1613/jair.1.11192

Published

2025-10-10

How to Cite

Souza, M. de, & Klug, D. L. (2025). Educational inequalities in enem: a perspective based on socioeconomic variables and machine learning. Regae: Revista De Gestão E Avaliação Educacional, e93251. https://doi.org/10.5902/2318133893251