Building Artificial Neural Networks for NLP Analysis and Classification of Target Content

Aleksey Rogachev; Elena Melikhova; Gennady Atamanov

Conference ProceedingsOPEN ACCESS

Building Artificial Neural Networks for NLP Analysis and Classification of Target Content

Rogachev A
Melikhova E
Atamanov G

Proceedings of the conference on current problems of our time: the relationship of man and society (CPT 2020) (2021) 531

DOI: 10.2991/assehr.k.210225.058

N/ACitations

11Readers

Abstract

The problems of analyzing texts in natural language (NLP) using artificial intelligence (AI) methods are caused by the semantic and lexicological diversity of texts. This circumstance causes the appearance of various machine learning (ML) metrics for neural network analysis. The problem of AI analysis is further complicated by the fact that the content under study often contains "information garbage", which is information noise, complicating the solution of a well-known problem of text classification. The lexicological diversity of Internet content requires improving the methods of neural network NLP analysis. The purpose of the research is to identify and solve problems that arise when analyzing information texts using artificial neural networks (ANN), using the example of socio-political content. Well-known NLP technologies include substantiation of the structure and formation of a subject-oriented database of text data bodies, construction of dictionaries based on frequency analysis, and digital vectorization of texts. To identify the latent semantic content, the expediency of using a dense vector representation of terms in a multidimensional space (the embedding model) is justified. In order to justify the choice of basic architectures developed by ANN to account for sequences and combinations of analyzed terms, modifications of convolutional (Conv1D) recurrent (CNN, LSTM, etc.) layers were selected that allow storing token sequences. Since such powerful layers contribute to the appearance of undesirable retraining of ANN, effective means of regularization are necessary, for example, dropout layers. The authors substantiate a modified NLP approach to identifying sociocultural and cyber threats contained in the information content of Internet resources. Based on the frequency analysis of the target Internet content, dictionaries of terms used for multi-class text analysis are pre-formed, as well as their markup. To justify and study the architecture and hyperparameters focused on the content of the analyzed subject area, the ANN family was built in Python using specialized libraries-Keras, ScikitLearn, and others. The ANN architecture included combinations of fully connected, convolutional, and/or recurrent layers. When training ANN in the Google Colaboratory environment, high-performance GPUs were used. Recommendations are given for selecting ANN hyperparameters that are invariant for various architectures of hidden layers of hybrid ANN focused on solving the problem of multiclass NLP analysis. The degree of correct text recognition in the test sample exceeded 80%. Recommendations for its improvment it are given. This is an open access article distributed under the CC BY-NC 4.0 license-http://creativecommons.org/licenses/by-nc/4.0/. 383

Cite

CITATION STYLE

APA

Rogachev, A., Melikhova, E., & Atamanov, G. (2021). Building Artificial Neural Networks for NLP Analysis and Classification of Target Content. In Proceedings of the conference on current problems of our time: the relationship of man and society (CPT 2020) (Vol. 531). Atlantis Press. https://doi.org/10.2991/assehr.k.210225.058

Building Artificial Neural Networks for NLP Analysis and Classification of Target Content

Abstract

Cite

Register to see more suggestions