Enhancing privacy and preserving accuracy of a distributed collaborative filtering
- ISBN: 9781595937308
- DOI: 10.1145/1297231.1297234
Abstract
Collaborative Filtering (CF) is a powerful technique for generating personalized predictions. CF systems are typically based on a central storage of user profiles used for generating the recommendations. However, such centralized storage introduces a severe privacy breach, since the profiles may be accessed for purposes, possibly malicious, not related to the recommendation process. Recent researches proposed to protect the privacy of CF by distributing the profiles between multiple repositories and exchange only a subset of the profile data, which is useful for the recommendation. This work investigates how a decentralized distributed storage of user profiles combined with data modification techniques may mitigate some privacy issues. Results of experimental evaluation show that parts of the user profiles can be modified without hampering the accuracy of CF predictions. The experiments also indicate which parts of the user profiles are most useful for generating accurate CF predictions, while their exposure still keeps the essential privacy of the users.
Enhancing privacy and preserving accuracy of a distributed collaborative filtering
of a Distributed Collaborative Filtering
Shlomo Berkvosky
University of Haifa, Israel
slavax@cs.haifa.ac.il
Yaniv Eytani
University of Illinois at
Urbana-Champaign, USA
yeytani2@uiuc.edu
Tsvi Kuflik
University of Haifa, Israel
tsvikak@is.haifa.ac.il
Francesco Ricci
Free University of
Bozen-Bolzano, Italy
fricci@unibz.it
ABSTRACT
Collaborative Filtering (CF) is a powerful technique for
generating personalized predictions. CF systems are typically
based on a central storage of user profiles used for generating the
recommendations. However, such centralized storage introduces a
severe privacy breach, since the profiles may be accessed for
purposes, possibly malicious, not related to the recommendation
process. Recent researches proposed to protect the privacy of CF
by distributing the profiles between multiple repositories and
exchange only a subset of the profile data, which is useful for the
recommendation. This work investigates how a decentralized
distributed storage of user profiles combined with data
modification techniques may mitigate some privacy issues.
Results of experimental evaluation show that parts of the user
profiles can be modified without hampering the accuracy of CF
predictions. The experiments also indicate which parts of the user
profiles are most useful for generating accurate CF predictions,
while their exposure still keeps the essential privacy of the users.
Categories and Subject Descriptors
H.3.4 [Information Storage and Retrieval]: Systems and
Software – distributed systems, user profile and alert services.
General Terms
Algorithms, Measurement, Performance, Experimentation
Keywords
Collaborative Filtering, Recommender Systems, Privacy.
1. INTRODUCTION
Collaborative Filtering (CF) [5] is one of the most popular and
widely-used personalization techniques. It generates personalized
recommendations, e.g., predictions of how a user may like an
item, based on the assumption that users who agreed in the past,
i.e., users whose opinions correlated in the past, will also agree in
the future [13]. The input for CF algorithm is a ratings matrix
containing user profiles represented by ratings vectors, i.e., lists
of user's ratings on a set of items. To generate a user's prediction
for an item, CF initially computes the degree of similarity
between the active user, i.e., the user whose preferences are being
predicted, and all the other users. Then, CF creates a
neighborhood of K users having the highest degree of similarity
with the active user and generates a prediction for a specific item
by computing a weighted average of the ratings of the other users
in the neighborhood on this item.
However, personalization inherently brings with it the issue of
privacy. Dealing with user profiles means that personal and
possibly sensitive information about users is collected, stored and
used by the recommender system. A system may violate users'
privacy by misusing (e.g., selling or exposing) users' private
information for their own benefits. As a result, the users that are
aware and concerned about such misuse, refrain from using them
to prevent potential exposure of sensitive private information [4].
Privacy hazards for recommender systems are aggravated by the
fact that it is commonly believed that accurate recommendations
require large amounts of personal data [11]. Thus, more complete
and accurate are the user profiles, i.e., the higher is the number of
ratings in the profile, the more reliable are the recommendations.
Hence, there is a trade-off between the users' privacy and the
accuracy of the recommendations provided to the users.
In this context, the need to protect users' privacy has triggered
growing research efforts. In [3] the authors proposed basing
privacy preservation on pure decentralized Peer-to-Peer (P2P)
communication between the users [1]. It was suggested to form
communities of users, where the overall community represents the
set of users as a whole and not as individual users. Alternatively,
[10] suggested preserving users' privacy on a central server by
adding uncertainty to the data by applying randomized data
obfuscation techniques that modify the user profiles. Hence, even
if the data are exposed to untrusted parties, they will not have a
reliable knowledge about the true ratings in the profiles. Current
work expands and validates the idea of combining these two
approaches, as initially discussed in [2]. It deals with enhancing
the privacy of CF through (1) substituting the commonly used
centralized CF system by a virtual P2P one, while (2) adding a
degree of uncertainty to the data by modifying parts of the user
profiles.
Individual users participate in the virtual P2P-based CF system in
the following way. The users maintain their own profiles in form
of ratings on items. Active users initiate prediction requests by
exposing parts of their profiles and sending them as part of the
prediction request. Other users, who actually respond to the
request, expose their ratings on the requested items and similarity
values with the active user, and send them to the active users,
jointly with the degree of similarity between them. Note that the
degree of similarity between the users was computed basing on
the ratings stored by the users and part of the active user profile,
received with the prediction request. The active users collect the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
RecSys'07, October 19–20, 2007, Minneapolis, Minnesota, USA.
Copyright 2007 ACM 978-1-59593-730-8/07/0010...$5.00.
9
users as the set of nearest neighbors and aggregate the ratings of
the neighbors for the prediction generation.
In this setting, the users are in full control of their personal
sensitive information and they can autonomously decide when
and how to expose their profiles. In particular, the users may
decide that part of their profiles should be obfuscated, i.e., some
noise can be added, before revealing them. As a result, the
proposed approach from one hand enhances users' privacy, while
from the other hand still allows them to support prediction
generation initiated by other users and to participate in CF
process.
In the experimental part of the paper, the accuracy of the
proposed privacy-enhanced CF is evaluated using publicly
available MovieLens CF dataset [5]. Initial experimental results
demonstrate that there is a linear relationship between the amount
of obfuscation applied to the personal ratings in the profiles and
the decrease in accuracy of the recommendation prediction. These
results raised a question regarding the importance of certain
ratings for the accuracy of CF recommendations, i.e., about the
relationship between the quality of the available data and the
accuracy of the generated recommendations. Although CF is a
well-studied technique, no prior works tried to understand what
kind of ratings is important for the accuracy of the generated
predictions. This is extremely important in the context of privacy,
as users may have different concerns about the potential exposure
of their data, and therefore the quantity of the user's personal data
exposed to other users, must be adapted to the kind of ratings that
are exposed.
For this, additional experiments aimed at analyzing the impact of
data obfuscation on different types of ratings (moderate ratings
with average values and extreme ratings with highly positive or
highly negative values) have been conducted. The results of the
experiments indicate that the accuracy of CF predictions is
affected by extreme ratings stronger than by moderate ratings.
Hence, the conclusion is that these parts of user profiles are the
most valuable for generating accurate predictions, and for this
reason they should be made available to other users. Conversely,
very little knowledge about the users may be derived from their
moderate ratings and, therefore, there is no need to expose these
parts of the profiles.
This work also presents the results of an exploratory survey
examining the users' attitude towards the privacy-preserving CF
techniques illustrated in this paper. We aimed at understanding if
the benefits of the proposed privacy-preserving techniques
actually correlate with the users' attitude towards the techniques,
i.e., if the user is convinced that the proposed techniques preserve
her privacy. The results of the survey confirm that obfuscation
methods having the smallest effect on the accuracy of the
predictions are also preferred by the user. But they also show that
the extreme ratings, which are more important for the predictions
generation than the moderate ratings, are also considered by the
users as more sensitive. This shows that there is no simple way to
better preserve users' privacy without decreasing the accuracy of
the predictions and that is difficult to optimize both the accuracy
of the predictions and privacy sense of the users.
The rest of the paper is organized as follows. Section 2 discusses
the privacy issues in CF and works on distributed CF. Section 3
presents the privacy-enhanced decentralized CF using user
profiles obfuscation. Section 4 presents the experimental results
evaluating the proposed obfuscation approach. Section 5 presents
the users' survey and analyzes its results, and section 6 concludes
the paper, and presents directions for future research.
2. RELATED WORKS
Centralized CF poses a severe threat to users' privacy, as personal
information collected by the systems can be potentially
transferred to untrusted parties. Thus, most users disagree to
divulge their private information and these concerns cause some
users to refrain from the benefits of recommender systems due to
the privacy risks [4]. Hence, applying CF without compromising
the user's privacy is one of the important and challenging issues in
CF research.
This issue was tackled in prior research from several perspectives.
In [10], the authors proposed a to preserve users' privacy in a
centralized CF server by adding uncertainty to the data. Before
transferring her profile to the server, each user obfuscated it using
randomized data modification techniques. Hence, the server
cannot find out the exact, but only the modified contents of the
profile. Although this method changed the users' original data,
experiments showed that the obfuscated data still allows
generating accurate CF predictions. This approach improved
users' privacy, but the users still depended on a centralized server
storing the user profiles. This constituted a single point of failure,
as the data could still be exposed by an attacker through a series
prediction requests for various items managed by the server.
Storing user profiles distributed between several locations reduces
the potential privacy breach of having all the data exposed to an
attacker, as the attacker must violate security policies of all the
locations, rather than of only the centralized one. Conducting CF
over a distributed setting was initially proposed in [14]. This work
presented a P2P architecture supporting recommendations for
mobile customers represented by software agents. The agents'
communication exploited an expensive routing mechanism,
increasing the communication overheads. Another technique for a
distributed CF eliminating the use of central servers was proposed
in [8]. There, the active users create queries by sending parts of
their profiles and requesting predictions for specific items. Other
users autonomously decide if they are willing to respond the
queries and send their information to the active users. However,
no data obfuscation was applied on the data, such that the original
user profiles were transferred between the users. Also, this
approach was neither implemented nor evaluated.
A basic scheme for a decentralized privacy-preserving CF was
proposed in [3]. According to it, individual users control their
private data, while they are grouped into a community of users,
representing public aggregation of their profiles. This aggregation
allows personalized predictions to be computed for the members
of the community or for outsiders by exposing the aggregated
community data, but without exposing the data of individual
users. In addition, the communication between the communities is
implemented using data encryption methods. Although this
approach protects overall users' privacy by abolishing a single
point of failure, it puts upfront the issue of preserving the privacy
of individual users, since their ratings are easier to expose than in
the centralized setting. Also, the proposed approach requires a
priori formation of user communities, which may become a severe
limitation in nowadays dynamic environments.
10
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


