Detecting sensitive information from textual documents: An information-theoretic approach

David Sánchez; Montserrat Batet; Alexandre Viejo

Conference Proceedings

Detecting sensitive information from textual documents: An information-theoretic approach

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7647 LNAI 173-184

DOI: 10.1007/978-3-642-34620-0_17

26Citations

31Readers

Get full text

Abstract

Whenever a document containing sensitive information needs to be made public, privacy-preserving measures should be implemented. Document sanitization aims at detecting sensitive pieces of information in text, which are removed or hidden prior publication. Even though methods detecting sensitive structured information like e-mails, dates or social security numbers, or domain specific data like disease names have been developed, the sanitization of raw textual data has been scarcely addressed. In this paper, we present a general-purpose method to automatically detect sensitive information from textual documents in a domain-independent way. Relying on the Information Theory and a corpus as large as the Web, it assess the degree of sensitiveness of terms according to the amount of information they provide. Preliminary results show that our method significantly improves the detection recall in comparison with approaches based on trained classifiers. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Sánchez, D., Batet, M., & Viejo, A. (2012). Detecting sensitive information from textual documents: An information-theoretic approach. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7647 LNAI, pp. 173–184). https://doi.org/10.1007/978-3-642-34620-0_17

Detecting sensitive information from textual documents: An information-theoretic approach

Abstract

Author supplied keywords

Cite

Register to see more suggestions