Sign up & Download
Sign in

A Survey of Collaborative Filtering Techniques

by Xiaoyuan Su, Taghi M Khoshgoftaar
Advances in Artificial Intelligence ()

Abstract

As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, modelbased, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.

Cite this document (BETA)

Available from www.hindawi.com
Page 1
hidden

A Survey of Collaborative Filteri...

Hindawi Publishing Corporation Advances in Artificial Intelligence Volume 2009, Article ID 421425, 19 pages doi:10.1155/2009/421425 Review Article A Survey of Collaborative Filtering Techniques Xiaoyuan Su and Taghi M. Khoshgoftaar Department of Computer Science and Engineering, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431, USA Correspondence should be addressed to Xiaoyuan Su, suxiaoyuan@gmail.com Received 9 February 2009 Accepted 3 August 2009 Recommended by Jun Hong As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, model- based, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area. Copyright �� 2009 X. Su and T. M. Khoshgoftaar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. Introduction In everyday life, people rely on recommendations from other people by spoken words, reference letters, news reports from news media, general surveys, travel guides, and so forth. Recommender systems assist and augment this natural social process to help people sift through available books, articles, webpages, movies, music, restaurants, jokes, grocery products, and so forth to find the most interesting and valuable information for them. The developers of one of the first recommender systems, Tapestry [1] (other earlier recommendation systems include rule-based recommenders and user-customization), coined the phrase ���collaborative filtering (CF),��� which has been widely adopted regardless of the facts that recommenders may not explicitly collaborate with recipients and recommendations may suggest particu- larly interesting items, in addition to indicating those that should be filtered out [2]. The fundamental assumption of CF is that if users X and Y rate n items similarly, or have similar behaviors (e.g., buying, watching, listening), and hence will rate or act on other items similarly [3]. CF techniques use a database of preferences for items by users to predict additional topics or products a new user might like. In a typical CF scenario, there is a list of m users {u1, u2, . . . , um} and a list of n items {i1, i2, . . . , in}, and each user, ui, has a list of items, Iui, which the user has rated, or about which their preferences have been inferred through their behaviors. The ratings can either be explicit indications, and so forth, on a 1���5 scale, or implicit indications, such as purchases or click-throughs [4]. For example, we can convert the list of people and the movies they like or dislike (Table 1(a)) to a user-item ratings matrix (Table 1(b)), in which Tony is the active user that we want to make recommendations for. There are missing values in the matrix where users did not give their preferences for certain items. There are many challenges for collaborative filtering tasks (Section 2). CF algorithms are required to have the ability to deal with highly sparse data, to scale with the increasing numbers of users and items, to make satisfactory recommendations in a short time period, and to deal with other problems like synonymy (the tendency of the same or similar items to have different names), shilling attacks, data noise, and privacy protection problems. Early generation collaborative filtering systems, such as GroupLens [5], use the user rating data to calculate the simi- larity or weight between users or items and make predictions or recommendations according to those calculated similarity values. The so-called memory-based CF methods (Section 3) are notably deployed into commercial systems such as http://www.amazon.com/ (see an example in Figure 1) and
Page 2
hidden
2 Advances in Artificial Intelligence Table 1: An example of a user-item matrix. (a) Alice: (like) Shrek, Snow White, (dislike) Superman Bob: (like) Snow White, Superman, (dislike) spiderman Chris: (like) spiderman, (dislike) Snow white Tony: (like) Shrek, (dislike) Spiderman (b) Shrek Snow White Spider-man Super-man Alice Like Like Dislike Bob Like Dislike Like Chris Dislike Like Tony Like Dislike ? Barnes and Noble, because they are easy-to-implement and highly effective [6, 7]. Customization of CF systems for each user decreases the search effort for users. It also promises a greater customer loyalty, higher sales, more advertising revenues, and the benefit of targeted promotions [8]. However, there are several limitations for the memory- based CF techniques, such as the fact that the similarity values are based on common items and therefore are unreliable when data are sparse and the common items are therefore few. To achieve better prediction performance and overcome shortcomings of memory-based CF algorithms, model-based CF approaches have been investigated. Model- based CF techniques (Section 4) use the pure rating data to estimate or learn a model to make predictions [9]. The model can be a data mining or machine learning algorithm. Well-known model-based CF techniques include Bayesian belief nets (BNs) CF models [9���11], clustering CF models [12, 13], and latent semantic CF models [7]. An MDP (Markov decision process)-based CF system [14] produces a much higher profit than a system that has not deployed the recommender. Besides collaborative filtering, content-based filtering is another important class of recommender systems. Content- based recommender systems make recommendations by analyzing the content of textual information and finding regularities in the content. The major difference between CF and content-based recommender systems is that CF only uses the user-item ratings data to make predictions and recommendations, while content-based recommender systems rely on the features of users and items for predictions [15]. Both content-based recommender systems and CF systems have limitations. While CF systems do not explicitly incorporate feature information, content-based systems do not necessarily incorporate the information in preference similarity across individuals [8]. Hybrid CF techniques, such as the content-boosted CF algorithm [16] and Personality Diagnosis (PD) [17], com- bine CF and content-based techniques, hoping to avoid the limitations of either approach and thereby improve recommendation performance (Section 5). A brief overview of CF techniques is depicted in Table 2. Figure 1: Amazon recommends products to customers by cus- tomizing CF systems. To evaluate CF algorithms (Section 6), we need to use metrics according to the types of CF application. Instead of classification error, the most widely used evaluation metric for prediction performance of CF is Mean Absolute Error (MAE). Precision and recall are widely used metrics for ranked lists of returned items in information retrieval research. ROC sensitivity is often used as a decision support accuracy metric. As drawing convincing conclusions from artificial data is risky, data from live experiments are more desirable for CF research. The commonly used CF databases are MovieLens [18], Jester [19], and Netflix prize data [20]. In Section 7, we give the conclusion and discussion of this work. 2. Characteristics and Challenges of Collaborative Filtering E-commerce recommendation algorithms often operate in a challenging environment, especially for large online shopping companies like eBay and Amazon. Usually, a recommender system providing fast and accurate recom- mendations will attract the interest of customers and bring benefits to companies. For CF systems, producing high- quality predictions or recommendations depends on how well they address the challenges, which are characteristics of CF tasks as well. 2.1. Data Sparsity. In practice, many commercial recom- mender systems are used to evaluate very large product sets. The user-item matrix used for collaborative filtering will thus be extremely sparse and the performances of the predictions or recommendations of the CF systems are challenged. The data sparsity challenge appears in several situations, specifically, the cold start problem occurs when a new user or item has just entered the system, it is di���cult to find similar ones because there is not enough information (in some literature, the cold start problem is also called the new user problem or new item problem [21, 22]). New items cannot be recommended until some users rate it, and new
Page 3
hidden
Advances in Artificial Intelligence 3 Table 2: Overview of collaborative filtering techniques. CF categories Representative techniques Main advantages Main shortcomings Memory-based CF ���Neighbor-based CF (item-based/user-based CF algorithms with Pearson/vector cosine correlation) ���easy implementation ���are dependent on human ratings ���new data can be added easily and incrementally ���performance decrease when data are sparse ���Item-based/user-based top-N recommendations ���need not consider the content of the items being recommended ���cannot recommend for new users and items ���scale well with co-rated items ���have limited scalability for large datasets Model-based CF ���Bayesian belief nets CF ���better address the sparsity, scalability and other problems ���expensive model-building ���clustering CF ���MDP-based CF ���improve prediction performance ���have trade-off between prediction performance and scalability ���latent semantic CF ���sparse factor analysis ���give an intuitive rationale for recommendations ���lose useful information for dimensionality reduction techniques ���CF using dimensionality reduction techniques, for example, SVD, PCA Hybrid recommenders ���content-based CF recommender, for example, Fab ���overcome limitations of CF and content-based or other recommenders ���have increased complexity and expense for implementation ���content-boosted CF ���improve prediction performance ���need external information that usually not available ���hybrid CF combining memory-based and model-based CF algorithms, for example, Personality Diagnosis ���overcome CF problems such as sparsity and gray sheep users are unlikely given good recommendations because of the lack of their rating or purchase history. Coverage can be defined as the percentage of items that the algorithm could provide recommendations for. The reduced coverage problem occurs when the number of users��� ratings may be very small compared with the large number of items in the system, and the recommender system may be unable to generate recommendations for them. Neighbor transitivity refers to a problem with sparse databases, in which users with similar tastes may not be identified as such if they have not both rated any of the same items. This could reduce the effectiveness of a recommendation system which relies on comparing users in pairs and therefore generating predictions. To alleviate the data sparsity problem, many approaches have been proposed. Dimensionality reduction techniques, such as Singular Value Decomposition (SVD) [23], remove unrepresentative or insignificant users or items to reduce the dimensionalities of the user-item matrix directly. The patented Latent Semantic Indexing (LSI) used in information retrieval is based on SVD [24, 25], in which similarity between users is determined by the representation of the users in the reduced space. Goldberg et al. [3] developed eigentaste, which applies Principle Component Analysis (PCA), a closely-related factor analysis technique first described by Pearson in 1901 [26], to reduce dimensionality. However, when certain users or items are discarded, useful information for recommendations related to them may get lost and recommendation quality may be degraded [6, 27]. Hybrid CF algorithms, such as the content-boosted CF algorithm [16], are found helpful to address the sparsity problem, in which external content information can be used to produce predictions for new users or new items. In Ziegler et al. [28], a hybrid collaborative filtering approach was proposed to exploit bulk taxonomic information designed for exact product classification to address the data sparsity problem of CF recommendations, based on the generation of profiles via inference of super-topic score and topic diversification [28]. Schein et al. proposed the aspect model latent variable method for cold start recommendation, which combines both collaborative and content information in model fitting [29]. Kim and Li proposed a probabilistic model to address the cold start problem, in which items are classified into groups and predictions are made for users considering the Gaussian distribution of user ratings [30]. Model-based CF algorithms, such as TAN-ELR (tree aug- mented na�� ��ve Bayes optimized by extended logistic regres- sion) [11, 31], address the sparsity problem by providing more accurate predictions for sparse data. Some new model- based CF techniques that tackle the sparsity problem include the association retrieval technique, which applies an asso- ciative retrieval framework and related spreading activation algorithms to explore transitive associations among users through their rating and purchase history [32] Maximum margin matrix factorizations (MMMF), a convex, infinite dimensional alternative to low-rank approximations and standard factor models [33, 34] ensembles of MMMF [35] multiple imputation-based CF approaches [36] and imputation-boosted CF algorithms [37].

Readership Statistics

372 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
35% Ph.D. Student
 
23% Student (Master)
 
9% Student (Bachelor)
by Country
 
18% United States
 
9% China
 
9% United Kingdom

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in