Automated big security text pruning and classification

10Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Many security related big data problems, including document, traffic, and system log analysis require analysis of unstructured text. Consider the task of analyzing company documents for secure storage. Some might be too sensitive to put on a public cloud and require private storage with associated backup overhead, some may safe on the cloud in encrypted form, and some may be sufficiently non-sensitive to be stored on the cloud in plain-text without encryption and decryption overhead. Being able to make such categorizations autonomously can significantly strengthen data security, organization, and storage efficiency. In this paper, we analyze several base machine learning based security risk assessment algorithms and develop techniques to improve upon standard algorithms. In particular, we examine labeling document sensitivity, labeling each paragraph in the document with one of three levels of security risk. For evaluation, we use real sensitive texts, from documents leaked by the WikiLeaks organization. We improve upon the base models using probabilistic topic modeling via Latent Dirichlet Analysis to identify samples from impure subtopics in the training set, prior to training a logistic regression classifier.

Cite

CITATION STYLE

APA

Alzhrani, K., Rudd, E. M., Chow, C. E., & Boult, T. E. (2016). Automated big security text pruning and classification. In Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 (pp. 3629–3637). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/BigData.2016.7841028

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free