Deception in computer-mediated communication represents a threat, and there is a growing need to develop efficient methods of detecting it. Machine learning models have, through natural language processing, proven to be extremely successful at detecting lexical patterns related to deception. In this study, four selected machine learning models are trained and tested on data collected through a crowdsourcing platform on the topics of COVID-19 and climate change. The performance of the models was tested by analyzing n-grams (from unigrams to trigrams) and by using psycho-linguistic analysis. A selection of important features was carried out and further deepened with additional testing of the models on different subsets of the obtained features. This study concludes that the subjectivity of the collected data greatly affects the detection of hidden linguistic features of deception. The psycho-linguistic analysis alone and in combination with n-grams achieves better classification results than an n-gram analysis while testing the models on own data, but also while examining the possibility of generalization, especially on trigrams where the combined approach achieves a notably higher accuracy of up to 16%. The n-gram analysis proved to be a more robust method during the testing of the mutual applicability of the models while psycho-linguistic analysis remained most inflexible.
CITATION STYLE
Brzic, B., Boticki, I., & Bagic Babac, M. (2023). Detecting Deception Using Natural Language Processing and Machine Learning in Datasets on COVID-19 and Climate Change. Algorithms, 16(5). https://doi.org/10.3390/a16050221
Mendeley helps you to discover research relevant for your work.