Web query logs provide a rich wealth of information, but also present serious privacy risks. We preserve privacy in publishing vocabularies extracted from a web query log by introducing vocabulary k-anonymity, which prevents the privacy attack of re-identification that reveals the real identities of vocabularies. A vocabulary is a bag of query-terms extracted from queries issued by a user at a specified granularity. Such bag-valued data are extremely sparse, which makes it hard to retain enough utility in enforcing k-anonymity. To the best of our knowledge, the prior works do not solve such a problem, among which some achieve a different privacy principle, for example, differential privacy, some deal with a different type of data, for example, set-valued data or relational data, and some consider a different publication scenario, for example, publishing frequent keywords. To retain enough data utility, a semantic similarity-based clustering approach is proposed, which measures the semantic similarity between a pair of terms by the minimum path distance over a semantic network of terms such as WordNet, computes the semantic similarity between two vocabularies by a weighted bipartite matching, and publishes the typical vocabulary for each cluster of semantically similar vocabularies. Extensive experiments on the AOL query log show that our approach can retain enough data utility in terms of loss metrics and in frequent pattern mining.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below