Automated big security text pruning and classification

Khudran Alzhrani; Ethan M. Rudd; C. Edward Chow; Terrance E. Boult

Conference Proceedings

Automated big security text pruning and classification

Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 (2016) 3629-3637

DOI: 10.1109/BigData.2016.7841028

11Citations

15Readers

Get full text

Abstract

Many security related big data problems, including document, traffic, and system log analysis require analysis of unstructured text. Consider the task of analyzing company documents for secure storage. Some might be too sensitive to put on a public cloud and require private storage with associated backup overhead, some may safe on the cloud in encrypted form, and some may be sufficiently non-sensitive to be stored on the cloud in plain-text without encryption and decryption overhead. Being able to make such categorizations autonomously can significantly strengthen data security, organization, and storage efficiency. In this paper, we analyze several base machine learning based security risk assessment algorithms and develop techniques to improve upon standard algorithms. In particular, we examine labeling document sensitivity, labeling each paragraph in the document with one of three levels of security risk. For evaluation, we use real sensitive texts, from documents leaked by the WikiLeaks organization. We improve upon the base models using probabilistic topic modeling via Latent Dirichlet Analysis to identify samples from impure subtopics in the training set, prior to training a logistic regression classifier.

Author supplied keywords

Cite

CITATION STYLE

APA

Alzhrani, K., Rudd, E. M., Chow, C. E., & Boult, T. E. (2016). Automated big security text pruning and classification. In Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 (pp. 3629–3637). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/BigData.2016.7841028

Automated big security text pruning and classification

Abstract

Author supplied keywords

Cite

Register to see more suggestions