Avaliação de Métodos de Mineração de Textos Aplicados à Detecção de Fake News Eleitorais Brasileiras

Caio Vinícius Meneses Silva; Raphael Silva Fontes; Methanias Colaço Júnior

doi:10.5902/2175497763139

Autores

Caio Vinícius Meneses Silva Universidade Federal de Sergipe https://orcid.org/0000-0002-3242-660X
Raphael Silva Fontes Universidade Federal de Sergipe https://orcid.org/0000-0003-3160-3384
Methanias Colaço Júnior Universidade Federal de Sergipe https://orcid.org/0000-0002-4811-1477

DOI:

https://doi.org/10.5902/2175497763139

Palavras-chave:

Eleições, Experimentação, Fake News

Resumo

Contexto: A evolução dos meios de comunicação tem contribuído com a disseminação de notícias falsas, principalmente após o surgimento das redes sociais digitais. A velocidade com que estas notícias se espalham tornaram inviável a checagem manual desse imenso volume de dados. Diante deste contexto, trabalhos em diversas áreas têm sido realizados a fim de tentar minimizar os danos causados pela proliferação das denominadas fake news. Objetivo: O objetivo deste trabalho é avaliar a eficácia dos métodos mais utilizados para verificar correspondência de textos, no contexto da detecção de notícias falsas, tendo como base as eleições presidenciais brasileiras de 2018, bem como fazendo um comparativo com os resultados da eleição norte-americana de 2016, publicados na literatura. Adicionalmente, uma visão geral das fakes por seguidores de cada candidato é apresentada. Método: Foi planejado e executado um experimento controlado, para comparar a eficácia dos métodos selecionados. Resultados: Os métodos TF-IDF e BM25 se destacaram nesse contexto, possuindo, estatisticamente e respectivamente, médias similares de Acurácia (79,86% e 79,00%), Precisão (79,97% e 78,76%), Sensibilidade (78,97% e 76,05%) e Medida-F1 (79,47% e 77,38%). Conclusão: A eficácia foi similar à do contexto norte-americano, no qual o BM25 alcançou uma Acurácia de 79,99%. Além disso, considerando o universo de notícias checadas disponível, o período analisado e uma margem de erro de 3,5%, evidenciou-se que houve divulgação de fakes por ambos os lados e que seguidores do candidato Jair Bolsonaro (PSL) foram responsáveis por 62,25% dos tweets relacionados a notícias falsas, contra 37,75% dos seguidores do candidato Fernando Haddad (PT). No que diz respeito às contas excluídas da rede social em um curto espaço de tempo, 59,96% eram de seguidores do candidato do PSL e 40,04% de seguidores do candidato do PT. A divulgação de fake news nem sempre implica intenção, podendo implicar apenas um engajamento maior por parte de alguns seguidores.

Downloads

Não há dados estatísticos.

Biografia do Autor

Caio Vinícius Meneses Silva, Universidade Federal de Sergipe

Universidade Federal de Sergipe

Raphael Silva Fontes, Universidade Federal de Sergipe

Universidade Federal de Sergipe

Referências

AL-ANZI, Fawaz S.; ABUZEINA, Dia. Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. Journal of King Saud University-Computer and Information Sciences, v. 29, n. 2, p. 189-195, 2017. ALLCOTT, Hunt; GENTZKOW, Matthew. Social media and fake news in the 2016 election. Journal of economic perspectives, p. 211-36, 2017.

AMAZON. Amazon Web Services, 15 out. 2019. Disponível em: <https://aws.amazon.com/pt/>. Acesso em 15 de Outubro de 2019

ANJOS, A. Análise de Variância, 2009, Acessado em 18 de Outubro de 2019. Disponível em: <http://www.est.ufpr.br/ce003/material/apostilace003.pdf>.

BASILI, Victor R.; WEISS, David M. A methodology for collecting valid software engineering data. IEEE Transactions on software engineering, n. 6, p. 728-738, 1984.

BIRD, Steven. NLTK, 01 de set .de 2020. Disponível em: <https://www.nltk.org/>. Acesso em : 01 de Setembro de 2020.

BUCKLEY, Chris. The importance of proper weighting methods. In: Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993.

CAELEN, Olivier. A Bayesian interpretation of the confusion matrix. Annals of Mathematics and Artificial Intelligence, v. 81, n. 3, p. 429-450, 2017.

CASTILLO, Carlos; MENDOZA, Marcelo; POBLETE, Barbara. Information credibility on twitter. In: Proceedings of the 20th international conference on World wide web. 2011. p. 675-684..

CIAMPAGLIA, Giovanni Luca et al. Computational fact checking from knowledge networks. PloS one, v. 10, n. 6, p. e0128193, 2015.

COLLINS. Collins Dictionary. Collins, 25 mar. 2017. Disponível em: <https://www.collinsdictionary.com/word-lovers-blog/new/collins-2017-word-of-the-year-shortlist,396,HCB.html>. Acesso em: 25 de Março de 2017.

CONROY, Nadia K.; RUBIN, Victoria L.; CHEN, Yimin. Automatic deception detection: Methods for finding fake news. Proceedings of the association for information science and technology, v. 52, n. 1, p. 1-4, 2015.

CONTRATRES, Felipe. Similaridade entre títulos de produtos com Word2Vec. Medium, 2020. Disponível em: <https://medium.com/luizalabs/similaridade-entre-t%C3%ADtulos-de-produtos-com-word2vec-5e26199862f0>. Acesso em: 16 Novembro de 2020.

DAVID, Lazer et al. The science of fake news. Science, v. 359, n. 6380, p. 1094-1096, 2018..

DEKKER, Gerben W.; PECHENIZKIY, Mykola; VLEESHOUWERS, Jan M. Predicting Students Drop Out: A Case Study. International Working Group on Educational Data Mining, 2009.

ELASTIC. Practical BM25 - Part 2: The BM25 Algorithm and its Variables, 2020. Elasitc. Disponível em: <https://www.elastic.co/pt/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables>. Acesso em: 15 Outubro de 2020.

FATOS, Aos. Aos Fatos, 14 ago. 2018. Disponível em: <https://www.aosfatos.org/>. Acesso em: 15 de Agosto de 2018.

FERNANDES, Hugo Miguel Moutinho. As novas guerras: o desafio da guerra híbrida. Revista de Ciências Militares, v.4, 2016.

FIELD, Andy. Descobrindo a estatística usando o SPSS-5. Penso Editora, 2009.

FRIEDMAN, Milton. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association, v. 32, n. 200, p. 675-701, 1937.

FRIGGERI, Adrien et al. Rumor Cascades. Eighth International AAAI Conference on Weblogs and Social Media. Michigan: [s.n.]. 2014. p. 101-110.

GUPTA, Manish; ZHAO, Peixiang; HAN, Jiawei. Evaluating event credibility on twitter. Proceedings of the 2012 SIAM International Conference on Data Mining. [S.l.]: SIAM. 2012. p. 153-164.

HAN, Jiawei; PEI, Jian; TONG, Hanghang. Data mining: concepts and techniques. Morgan kaufmann, 2022.

HAND, David J. Principles of data mining. Drug safety, v.30, n.7, p.621-622, 2007.

HASTIE, T.; TIBSHIRANI, R.; FRIEDMAN, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2011.

JIN, Zhiwei et al. News credibility evaluation on microblog with a hierarchical propagation model. In: 2014 IEEE international conference on data mining. IEEE, 2014. p. 230-239.

JIN, Zhiwei et al. Detection and analysis of 2016 us presidential election related rumors on twitter. International conference on social computing, behavioral-cultural modeling and prediction and behavior representation in modeling and simulation. [S.l.]: Springer. 2017. p. 14-24.

KAGGLE. Kaggle, 18 out. 2018. Disponível em: < https://www.kaggle.com/caiovms/datasets?scroll=true>. Acesso em: 18 de Outubro de 2018.

LE, Quoc; MIKOLOV, Tomas. Distributed representations of sentences and documents. In: International conference on machine learning. PMLR, 2014. p. 1188-1196.

LEVENE, H. Robust tests for equality of variances. International Journal of Machine Learning and Cybernetics, 1960. 278-292.

LI, Baoli; HAN, Liping. Distance weighted cosine similarity measure for text classification. International Conference on Intelligent Data Engineering and Automated Learning. Springer. 2013. p. 611-618.

LUHN, Hans P. The automatic creation of literature abstracts. IBM Journal of research and development, p. 159-165, 1958.

LUPA, Agência. Agência Lupa, 20 out. 2019. Disponível em: <https://piaui.folha.uol.com.br/lupa/>. Acesso em: 20 de Outubro de 2019.

MACHADO, Emerson Lopes. Um estudo de limpeza em base de dados desbalanceada e com sobreposição de classes. 2007..

MAKICE, Kevin. Twitter API: Up and running: Learn how to build applications with the Twitter API. " O'Reilly Media, Inc.", 2009.

MÁRQUEZ-VERA, Carlos; MORALES, Cristóbal R.; SOTO, Sebastian V. Predicting school failure and dropout by using data mining techniques. IEEE Revista Iberoamericana de Tecnologias del Aprendizaje, p. 7-14, 2013.

MIKOLOV, Tomas et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, v. 26, 2013.

MONGODB. MongoDB, 15 jan. 2020. Disponível em: <https://www.mongodb.com/>. Acesso em: 15 de Janeiro de 2020.

MORRIS, Meredith Ringel et al. Tweeting is believing? Understanding microblog credibility perceptions. In: Proceedings of the ACM 2012 conference on computer supported cooperative work. 2012. p. 441-450.

MUNDSTOCK, Elsa et al. Introdução à Análise Estatística utilizando o SPSS 13.0. Cadernos de Matemática e Estatística Série B. Universidade Federal do Rio Grande do Sul, Porto Alegre, RS, 2006.

REIS, Julio CS; BENEVENUTO, Fabrício. Supervised Learning for Misinformation Detection in WhatsApp. In: Proceedings of the Brazilian Symposium on Multimedia and the Web. 2021. p. 245-252.

OGAWA, Taro. Num2Words, 09 jan. 2020. Disponível em: <https://pypi.org/project/num2words/>. Acesso em: 09 de Janeiro de 2018.

OLIVEIRA, Robert A N de; COLAÇO JÚNIOR, Methanias. Experimental analysis of stemming on jurisprudential documents retrieval. Information, 2018. 28.

ÖZCAN, Said. Tweet Pre-Processor, 01 set. 2020. Disponível em: <https://pypi.org/project/tweet-preprocessor/>. Acesso em: 01 de Setembro de 2018.

PATRO, V M.; PATRA, Manas R. Augmenting weighted average with confusion matrix to enhance classification accuracy. Transactions on Machine Learning and Artificial Intelligence, p.77-91, 2014.

PEDREGOSA, Fabian et al. Scikit-learn: Machine learning in Python. The Journal of machine Learning research, p. 2825-2830, 2011.

PÚBLICA, Agência. Agência Pública, 08 maio 2018. Disponível em: <https://apublica.org/>.

ROBERTSON, Stephen; ZARAGOZA, Hugo. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, p. 333-389, 2009.

ROCHLIN, Nick. Fake news: belief in post-truth. Library hi tech, p. 386-392, 2017.

RONG, Xin. Word2Vec parameter learning explained. ArXiv preprint arXiv:1411.2738, 2014.

RUEDIGER, Marco A. et al. Robôs, redes sociais e política no Brasil: estudo sobre interferências ilegítimas no debate público na web, riscos à democracia e processo eleitoral de 2018. FGV DAPP, 2017.

SALTON, Gerard; WONG, Anita; CHUNG-SHU, Yang. A vector space model for automatic indexing. Communications of the ACM, p. 613-620, 1975.

SALTON, Gerrard; BUCKLEY, Christopher. Term-weighting approaches in automatic text retrieval. Information processing & management, p. 513-523, 1988.

SANTOS, Rafael Meneses et al. Long Term-short Memory Neural Networks and Word2vec for Self-admitted Technical Debt Detection. ICEIS. [S.l.]: [s.n.]. 2020. p. 157-165.

SEWARD, Lori E.; DOANE, David P. Estatística Aplicada à Administração e Economia-4. AMGH editora, 2014.

SHAPIRO, S.S; WILK, M.B. An Analysis of Variance Test for Normality (Complete Samples). International Journal of Machine Learning and Cybernetics, 1965. 591-611.

SPINELLI, Egle M.; ALMEIDA SANTOS, Jéssica. Jornalismo na era da pós-verdade: fact-checking como ferramenta de combate às fake news. Revista Observatório, p. 759-782, 2018.

SPSS. IBM SPSS software, 25 out. 2020. Disponível em: <https://www.ibm.com/analytics/spss-statistics-software>. Acesso em: 25 de Outubro 2020.

STATISTA. Statisa, Most popular social networks worldwide as of January 2022, ranked by number of monthly active users, 20 mar. 2019. Disponível em: <https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/>. Acesso em: 12 de Outubro 2019.

TIAN, Yuan; LO, David; SUN, Chengnian. Information retrieval based nearest neighbor classification for fine-grained bug severity prediction. 2012 19th Working Conference on Reverse Engineering. [S.l.]: IEEE. 2012. p. 215-224.

TRAVASSOS, Guilherme Horta; GUROV, Dmytro; AMARAL, E. A. G. G. Introdução à engenharia de software experimental. UFRJ, 2002.

TUMASJAN, Andranik et al. Predicting elections with twitter: What 140 characters reveal about political sentiment. In: Proceedings of the International AAAI Conference on Web and Social Media. 2010. p. 178-185..

TWITTER. Twitter muda regras para combater fake news e manipulação política, 20 mar. 2019. Disponível em: <https://help.twitter.com/pt/rules-and-policies/twitter-report-violation>. Acesso em: 20 de Março de 2020.

VOSOUGHI, Soroush; ROY, Deb; ARAL, Sinan. The spread of true and false news online. Science, p. 1146-1151, 2018.

WANG, Hao et al. A system for real-time twitter sentiment analysis of 2012 us presidential election cycle. In: Proceedings of the ACL 2012 system demonstrations. 2012. p. 115-120.

WHATSAPP. O WhatsApp continua pessoal e privado, 07 set. 2020. Disponível em: <https://blog.whatsapp.com/Keeping-WhatsApp-Personal-and-Private>. Acesso em: 07 de Setembro de 2020.

WILCOXON, Frank. Individual comparisons by ranking methods. In: Breakthroughs in statistics. Springer, New York, NY, 1992. p. 196-202..

WOHLIN, Claes et al. Experimentation in software engineering. Springer Science & Business Media, 2012.

WU, Ke; YANG, Song; ZHU, Kenny Q. False rumors detection on sina weibo by propagation structures. In: 2015 IEEE 31st international conference on data engineering. IEEE, 2015. p. 651-662.

ZHAO, Zhe; RESNICK, Paul; MEI, Qiaozhu. Enquiring minds: Early detection of rumors in social media from enquiry posts. In: Proceedings of the 24th international conference on world wide web. 2015. p. 1395-1405.

Avaliação de Métodos de Mineração de Textos Aplicados à Detecção de Fake News Eleitorais Brasileiras

Autores

DOI:

Palavras-chave:

Resumo

Downloads

Biografia do Autor

Caio Vinícius Meneses Silva, Universidade Federal de Sergipe

Raphael Silva Fontes, Universidade Federal de Sergipe

Referências

Downloads

Publicado

Como Citar

Edição

Seção

Licença

Publicado por

Enviar Submissão

Sobre a Revista

Idioma

Visitas

Informações

Notícias