Detecting sensitive information from textual documents: An information-theoretic approach

26Citations
Citations of this article
31Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Whenever a document containing sensitive information needs to be made public, privacy-preserving measures should be implemented. Document sanitization aims at detecting sensitive pieces of information in text, which are removed or hidden prior publication. Even though methods detecting sensitive structured information like e-mails, dates or social security numbers, or domain specific data like disease names have been developed, the sanitization of raw textual data has been scarcely addressed. In this paper, we present a general-purpose method to automatically detect sensitive information from textual documents in a domain-independent way. Relying on the Information Theory and a corpus as large as the Web, it assess the degree of sensitiveness of terms according to the amount of information they provide. Preliminary results show that our method significantly improves the detection recall in comparison with approaches based on trained classifiers. © 2012 Springer-Verlag.

Cite

CITATION STYLE

APA

Sánchez, D., Batet, M., & Viejo, A. (2012). Detecting sensitive information from textual documents: An information-theoretic approach. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7647 LNAI, pp. 173–184). https://doi.org/10.1007/978-3-642-34620-0_17

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free