Evaluation of Text Mining Methods Applied to The Detection of Brazilian Electoral Fake News
DOI:
https://doi.org/10.5902/2175497763139Keywords:
Eleições, Experimentação, Fake NewsAbstract
The evolution of the media has contributed to the spread of false news, especially after the emergence of digital social networks. The speed with which this news spread made it impossible to manually check this huge amount of data. In this context, work in several areas has been carried out in order to try to minimize the damage caused by the proliferation of socalled fake news. The objective of this work is to evaluate the effectiveness of the most used methods to check correspondence of texts, in the context of detecting false news, based on the Brazilian presidential elections of 2018, as well as making a comparison with the results of the US election. 2016, published in the literature. Additionally, an overview of the fake news by followers of each candidate is presented. A controlled experiment was planned and executed to compare the effectiveness of the selected methods. The TF-IDF and BM25 methods stood out in this context, having, statistically and respectively, similar averages of Accuracy (79,86% and 79,00%), Precision (79,97% and 78,76%), Sensitivity (78,97% and 76,05%) and Measure-F1 (79,47% and 77,38%). The effectiveness was similar to that of the North American context, in which the BM25 achieved an Accuracy of 79,99%. Furthermore, considering the universe of checked news available, the analyzed period and a margin of error of 3,5%, it was evident that fake news were disclosed by both sides and that followers of the candidate Jair Bolsonaro (PSL) were responsible for 62,25% of tweets related to fake news, against 37,75% of followers of candidate Fernando Haddad (PT). With regard to accounts deleted from the social network in a short time, 59,96% were followers of the PSL candidate and 40,04% of followers of the PT candidate. The dissemination of fake news does not always imply intention, and may only imply greater engagement by some followers.
Downloads
References
AL-ANZI, Fawaz S.; ABUZEINA, Dia. Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. Journal of King Saud University-Computer and Information Sciences, v. 29, n. 2, p. 189-195, 2017. ALLCOTT, Hunt; GENTZKOW, Matthew. Social media and fake news in the 2016 election. Journal of economic perspectives, p. 211-36, 2017.
AMAZON. Amazon Web Services, 15 out. 2019. Disponível em: <https://aws.amazon.com/pt/>. Acesso em 15 de Outubro de 2019
ANJOS, A. Análise de Variância, 2009, Acessado em 18 de Outubro de 2019. Disponível em: <http://www.est.ufpr.br/ce003/material/apostilace003.pdf>.
BASILI, Victor R.; WEISS, David M. A methodology for collecting valid software engineering data. IEEE Transactions on software engineering, n. 6, p. 728-738, 1984.
BIRD, Steven. NLTK, 01 de set .de 2020. Disponível em: <https://www.nltk.org/>. Acesso em : 01 de Setembro de 2020.
BUCKLEY, Chris. The importance of proper weighting methods. In: Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993.
CAELEN, Olivier. A Bayesian interpretation of the confusion matrix. Annals of Mathematics and Artificial Intelligence, v. 81, n. 3, p. 429-450, 2017.
CASTILLO, Carlos; MENDOZA, Marcelo; POBLETE, Barbara. Information credibility on twitter. In: Proceedings of the 20th international conference on World wide web. 2011. p. 675-684..
CIAMPAGLIA, Giovanni Luca et al. Computational fact checking from knowledge networks. PloS one, v. 10, n. 6, p. e0128193, 2015.
COLLINS. Collins Dictionary. Collins, 25 mar. 2017. Disponível em: <https://www.collinsdictionary.com/word-lovers-blog/new/collins-2017-word-of-the-year-shortlist,396,HCB.html>. Acesso em: 25 de Março de 2017.
CONROY, Nadia K.; RUBIN, Victoria L.; CHEN, Yimin. Automatic deception detection: Methods for finding fake news. Proceedings of the association for information science and technology, v. 52, n. 1, p. 1-4, 2015.
CONTRATRES, Felipe. Similaridade entre títulos de produtos com Word2Vec. Medium, 2020. Disponível em: <https://medium.com/luizalabs/similaridade-entre-t%C3%ADtulos-de-produtos-com-word2vec-5e26199862f0>. Acesso em: 16 Novembro de 2020.
DAVID, Lazer et al. The science of fake news. Science, v. 359, n. 6380, p. 1094-1096, 2018..
DEKKER, Gerben W.; PECHENIZKIY, Mykola; VLEESHOUWERS, Jan M. Predicting Students Drop Out: A Case Study. International Working Group on Educational Data Mining, 2009.
ELASTIC. Practical BM25 - Part 2: The BM25 Algorithm and its Variables, 2020. Elasitc. Disponível em: <https://www.elastic.co/pt/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables>. Acesso em: 15 Outubro de 2020.
FATOS, Aos. Aos Fatos, 14 ago. 2018. Disponível em: <https://www.aosfatos.org/>. Acesso em: 15 de Agosto de 2018.
FERNANDES, Hugo Miguel Moutinho. As novas guerras: o desafio da guerra híbrida. Revista de Ciências Militares, v.4, 2016.
FIELD, Andy. Descobrindo a estatística usando o SPSS-5. Penso Editora, 2009.
FRIEDMAN, Milton. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association, v. 32, n. 200, p. 675-701, 1937.
FRIGGERI, Adrien et al. Rumor Cascades. Eighth International AAAI Conference on Weblogs and Social Media. Michigan: [s.n.]. 2014. p. 101-110.
GUPTA, Manish; ZHAO, Peixiang; HAN, Jiawei. Evaluating event credibility on twitter. Proceedings of the 2012 SIAM International Conference on Data Mining. [S.l.]: SIAM. 2012. p. 153-164.
HAN, Jiawei; PEI, Jian; TONG, Hanghang. Data mining: concepts and techniques. Morgan kaufmann, 2022.
HAND, David J. Principles of data mining. Drug safety, v.30, n.7, p.621-622, 2007.
HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2011.
JIN, Zhiwei et al. News credibility evaluation on microblog with a hierarchical propagation model. In: 2014 IEEE international conference on data mining. IEEE, 2014. p. 230-239.
JIN, Zhiwei et al. Detection and analysis of 2016 us presidential election related rumors on twitter. International conference on social computing, behavioral-cultural modeling and prediction and behavior representation in modeling and simulation. [S.l.]: Springer. 2017. p. 14-24.
KAGGLE. Kaggle, 18 out. 2018. Disponível em: < https://www.kaggle.com/caiovms/datasets?scroll=true>. Acesso em: 18 de Outubro de 2018.
LE, Quoc; MIKOLOV, Tomas. Distributed representations of sentences and documents. In: International conference on machine learning. PMLR, 2014. p. 1188-1196.
LEVENE, H. Robust tests for equality of variances. International Journal of Machine Learning and Cybernetics, 1960. 278-292.
LI, Baoli; HAN, Liping. Distance weighted cosine similarity measure for text classification. International Conference on Intelligent Data Engineering and Automated Learning. Springer. 2013. p. 611-618.
LUHN, Hans P. The automatic creation of literature abstracts. IBM Journal of research and development, p. 159-165, 1958.
LUPA, Agência. Agência Lupa, 20 out. 2019. Disponível em: <https://piaui.folha.uol.com.br/lupa/>. Acesso em: 20 de Outubro de 2019.
MACHADO, Emerson Lopes. Um estudo de limpeza em base de dados desbalanceada e com sobreposição de classes. 2007..
MAKICE, Kevin. Twitter API: Up and running: Learn how to build applications with the Twitter API. " O'Reilly Media, Inc.", 2009.
MÁRQUEZ-VERA, Carlos; MORALES, Cristóbal R.; SOTO, Sebastian V. Predicting school failure and dropout by using data mining techniques. IEEE Revista Iberoamericana de Tecnologias del Aprendizaje, p. 7-14, 2013.
MIKOLOV, Tomas et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, v. 26, 2013.
MONGODB. MongoDB, 15 jan. 2020. Disponível em: <https://www.mongodb.com/>. Acesso em: 15 de Janeiro de 2020.
MORRIS, Meredith Ringel et al. Tweeting is believing? Understanding microblog credibility perceptions. In: Proceedings of the ACM 2012 conference on computer supported cooperative work. 2012. p. 441-450.
MUNDSTOCK, Elsa et al. Introdução à Análise Estatística utilizando o SPSS 13.0. Cadernos de Matemática e Estatística Série B. Universidade Federal do Rio Grande do Sul, Porto Alegre, RS, 2006.
REIS, Julio CS; BENEVENUTO, Fabrício. Supervised Learning for Misinformation Detection in WhatsApp. In: Proceedings of the Brazilian Symposium on Multimedia and the Web. 2021. p. 245-252.
OGAWA, Taro. Num2Words, 09 jan. 2020. Disponível em: <https://pypi.org/project/num2words/>. Acesso em: 09 de Janeiro de 2018.
OLIVEIRA, Robert A N de; COLAÇO JÚNIOR, Methanias. Experimental analysis of stemming on jurisprudential documents retrieval. Information, 2018. 28.
ÖZCAN, Said. Tweet Pre-Processor, 01 set. 2020. Disponível em: <https://pypi.org/project/tweet-preprocessor/>. Acesso em: 01 de Setembro de 2018.
PATRO, V M.; PATRA, Manas R. Augmenting weighted average with confusion matrix to enhance classification accuracy. Transactions on Machine Learning and Artificial Intelligence, p.77-91, 2014.
PEDREGOSA, Fabian et al. Scikit-learn: Machine learning in Python. The Journal of machine Learning research, p. 2825-2830, 2011.
PÚBLICA, Agência. Agência Pública, 08 maio 2018. Disponível em: <https://apublica.org/>.
ROBERTSON, Stephen; ZARAGOZA, Hugo. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, p. 333-389, 2009.
ROCHLIN, Nick. Fake news: belief in post-truth. Library hi tech, p. 386-392, 2017.
RONG, Xin. Word2Vec parameter learning explained. ArXiv preprint arXiv:1411.2738, 2014.
RUEDIGER, Marco A. et al. Robôs, redes sociais e política no Brasil: estudo sobre interferências ilegítimas no debate público na web, riscos à democracia e processo eleitoral de 2018. FGV DAPP, 2017.
SALTON, Gerard; WONG, Anita; CHUNG-SHU, Yang. A vector space model for automatic indexing. Communications of the ACM, p. 613-620, 1975.
SALTON, Gerrard; BUCKLEY, Christopher. Term-weighting approaches in automatic text retrieval. Information processing & management, p. 513-523, 1988.
SANTOS, Rafael Meneses et al. Long Term-short Memory Neural Networks and Word2vec for Self-admitted Technical Debt Detection. ICEIS. [S.l.]: [s.n.]. 2020. p. 157-165.
SEWARD, Lori E.; DOANE, David P. Estatística Aplicada à Administração e Economia-4. AMGH editora, 2014.
SHAPIRO, S.S; WILK, M.B. An Analysis of Variance Test for Normality (Complete Samples). International Journal of Machine Learning and Cybernetics, 1965. 591-611.
SPINELLI, Egle M.; ALMEIDA SANTOS, Jéssica. Jornalismo na era da pós-verdade: fact-checking como ferramenta de combate às fake news. Revista Observatório, p. 759-782, 2018.
SPSS. IBM SPSS software, 25 out. 2020. Disponível em: <https://www.ibm.com/analytics/spss-statistics-software>. Acesso em: 25 de Outubro 2020.
STATISTA. Statisa, Most popular social networks worldwide as of January 2022, ranked by number of monthly active users, 20 mar. 2019. Disponível em: <https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/>. Acesso em: 12 de Outubro 2019.
TIAN, Yuan; LO, David; SUN, Chengnian. Information retrieval based nearest neighbor classification for fine-grained bug severity prediction. 2012 19th Working Conference on Reverse Engineering. [S.l.]: IEEE. 2012. p. 215-224.
TRAVASSOS, Guilherme Horta; GUROV, Dmytro; AMARAL, E. A. G. G. Introdução à engenharia de software experimental. UFRJ, 2002.
TUMASJAN, Andranik et al. Predicting elections with twitter: What 140 characters reveal about political sentiment. In: Proceedings of the International AAAI Conference on Web and Social Media. 2010. p. 178-185..
TWITTER. Twitter muda regras para combater fake news e manipulação política, 20 mar. 2019. Disponível em: <https://help.twitter.com/pt/rules-and-policies/twitter-report-violation>. Acesso em: 20 de Março de 2020.
VOSOUGHI, Soroush; ROY, Deb; ARAL, Sinan. The spread of true and false news online. Science, p. 1146-1151, 2018.
WANG, Hao et al. A system for real-time twitter sentiment analysis of 2012 us presidential election cycle. In: Proceedings of the ACL 2012 system demonstrations. 2012. p. 115-120.
WHATSAPP. O WhatsApp continua pessoal e privado, 07 set. 2020. Disponível em: <https://blog.whatsapp.com/Keeping-WhatsApp-Personal-and-Private>. Acesso em: 07 de Setembro de 2020.
WILCOXON, Frank. Individual comparisons by ranking methods. In: Breakthroughs in statistics. Springer, New York, NY, 1992. p. 196-202..
WOHLIN, Claes et al. Experimentation in software engineering. Springer Science & Business Media, 2012.
WU, Ke; YANG, Song; ZHU, Kenny Q. False rumors detection on sina weibo by propagation structures. In: 2015 IEEE 31st international conference on data engineering. IEEE, 2015. p. 651-662.
ZHAO, Zhe; RESNICK, Paul; MEI, Qiaozhu. Enquiring minds: Early detection of rumors in social media from enquiry posts. In: Proceedings of the 24th international conference on world wide web. 2015. p. 1395-1405.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Animus. Revista Interamericana de Comunicação Midiática

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The authors of texts approved by the referees of Animus - Inter-American Journal of Media Communication automatically concede, and without any charge, the right to the first publication of the submitted material.