Sign up & Download
Sign in

Scale and Translation Invariant Collaborative Filtering Systems

by Daniel Lemire
Information Retrieval (2005)

Abstract

Collaborative filtering systems are prediction algorithms over sparse data sets of user preferences. We modify a wide range of state-of-the-art collaborative filtering systems to make them scale and translation invariant and generally improve their accuracy without increasing their computational cost. Using the EachMovie and the Jester data sets, we show that learning-free constant time scale and translation invariant schemes outperforms other learning-free constant time schemes by at least 3% and perform as well as expensive memory-based schemes (within 4%). Over the Jester data set, we show that a scale and translation invariant Eigentaste algorithm outperforms Eigentaste 2.0 by 20%. These results suggest that scale and translation invariance is a desirable property.

Cite this document (BETA)

Available from www.springerlink.com
Page 1
hidden

Scale and Translation Invariant Collaborative Filtering Systems

SCALE AND TRANSLATION INVARIANT COLLABORATIVE FILTERING SYSTEMS
DANIEL LEMIRE
ABSTRACT. Collaborative filtering systems are prediction algorithms over sparse data sets of user preferences. We modify
a wide range of state-of-the-art collaborative filtering systems to make them scale and translation invariant and generally
improve their accuracy without increasing their computational cost. Using the EachMovie and the Jester data sets, we
show that learning-free constant time scale and translation invariant schemes outperforms other learning-free constant time
schemes by at least 3% and perform as well as expensive memory-based schemes (within 4%). Over the Jester data set, we
show that a scale and translation invariant Eigentaste algorithm outperforms Eigentaste 2.0 by 20%. These results suggest
that scale and translation invariance is a desirable property.
1. INTRODUCTION
To be competitive, businesses need to help clients find quickly and accurately interesting products. Designing
software for this task becomes important as on-line shopping often does away with salespersons and offers a limited
view of the products to the prospective clients. Fortunately, businesses are often gathering large amounts of data about
their clients which makes automated recommendation systems possible. In a wider context, one of the most valuable
characteristic of the modern web is the ability to search through large amounts of dynamic data and any process that
can support these searches is valuable to the users.
Collaborative filtering systems are recommender systems where the recommendations are based on a database of
user ratings as opposed to content-based recommender systems which are based on the characteristics of the objects
to recommend. The basic principle behind collaborative filtering is that clients must first share some information
about themselves by rating some of the products or features they know, so that, in turn, they can get accurate recom-
mendations. Content-based recommender systems tend to work well with objects where the content can be processed
with some convenience such as text [1, 13]. With other types of objects such as movies or books, it is not always
easy to access the content on-line, and even if possible, automated content processing is likely to be inaccurate. Also,
content-based filtering is sometimes difficult as the user may simply not have enough information about the product or
service required. Someone surfing on a e-commerce web site might not always have a specific request and the burden
is on the web site to provide an interesting recommendation. In such cases and if we can get some ratings from the
users either explicitly or implicitly, we may prefer collaborative filtering systems. In other cases where content-based
filtering is efficient, collaborative filtering may serve to help sort results.
However, one of the challenges we face is that most users rate only few objects and thus, we have to deal with
sparse data [6]. In many information retrieval tasks, the software is faced with large sets of accurate data and specific
queries that must be matched. On the other hand, collaborative filtering has to deal with a severe lack of information
and the information available is both imprecise and inaccurate. Thus, collaborative filtering is a prediction rather than
a search problem.
From an algorithmic point of view, it is convenient to classify collaborative filtering algorithms in three classes
depending on their query and update costs: learning-free, memory-based and model-based. Obviously, there might
be many types of operations that could be described as an update or a query, but we focus our attention on adding
a user and its ratings to a database (update) or asking for a prediction of all ratings for a given user (query). We
say that an operation whose complexity is independent of the number of users offers constant-time performance
(with respect to the number of users). Essentially, the cheapest schemes are described as learning-free and have both
constant-time updates and queries while schemes involving a comparison with users in the database are classified as
memory-based and offer constant-time updates but linear-time queries, and finally the schemes requiring more than
linear time learning or more sophisticated updates are said to be model-based (see Tab. 1). There are schemes that
would not fit in any one of these three classes of algorithms and others that would fit in more than one class.
Key words and phrases. Recommender System, Incomplete Vectors, Regression, e-Commerce.
To appear in Journal of Information Retrieval. This document may differ from the article published by Kluwer, please refer to it when available.
NRC 46508.
1
Page 2
hidden
update query learning
learning-free O(1) O(1) O(m)
memory-based O(1) O(m) No
model-based Variable O(1) Variable
TABLE 1. Typical complexities with respect to the number of users m of some classes of collabo-
rative filtering algorithms.
Typically, learning-free schemes are derived from vectors {vk} that are computed in linear time irrespective of the
current user and the prediction is written as
Prediction(u) =
N

k=0
βk(u)v(k)
where the result of the predictor is itself a vector where each component is the rating corresponding to an item. For
example, the simplest learning-free scheme is obtained when N = 0, v(0) = 1 where 1 =(1, . . . ,1) and Prediction(u)=
u where u is the average over the known ratings. Another such scheme is obtained when N = 0, v(0)k is the average
rating received by item number k, and Prediction(u) = v(0).
Memory-based collaborative filtering systems usually compute weighted averages over ratings already in the data-
base where the weights are given by a correlation measure [3, 12] or any similar measure [17] including probabilistic
ones [10]. Generally, we can write a memory-based prediction as
Prediction(u) = F(u)+∑
w
ω(w,u)w
where F(u) is a learning-free prediction and where the sum is over all users in the database with ω(w,u) some
measure of similarity between w and u. Because not all users have rated all items, the sum can be different for each
item and we will make this point precise later. As there is little precomputation, updates to the database are fast, but
queries tend to be slow as we need to match the current user against the entire database each time. Memory-based
systems can outperform a wide range of model-based systems [3, 10] and accordingly, they are often used as reference
collaborative filtering systems for benchmarking purposes. The main drawback of memory-based scheme is their lack
of scalability. Some authors have proposed selecting the most representative or useful users from the database [18, 19]
making memory-based systems more balanced in terms of update and query performance while preserving and even
increasing slightly the accuracy. However, unlike learning-free and model-based schemes, memory-based systems
require access to a database at all time and thus there are privacy issues [4] and a memory-based system cannot run
conveniently on devices with very limited storage.
If all possible preference sets were equally likely, no prediction would be possible and since predictions have been
shown to be reliable [3], it must be that there are many hidden constraints and few remaining degrees of freedom
which suggests making predictions based on a model. Model-based collaborative filtering systems extract from the
database some key parameters and do not use the database directly to answer queries. Examples include Principal
Components Analysis (PCA) [7], Factor Analysis [4], Singular Value Decomposition [5, 15], Bayesian Networks [3],
Item-Based models [16, 14] and Neural Networks [2]. Model-based systems tend to answer queries fast, most often
in constant time with respect to the number of users, but also run potentially expensive learning routines and are often
static in nature: updating the database can be expensive as it may require up to a completely new learning phase.
Another possible drawback is that most model-based systems assume a large database is available whereas we would
like collaborative filtering to work in a wide range of contexts.
One can test the accuracy of an algorithm by applying it on data where some of the ratings have been hidden.
While results vary depending on the data set and the experimental protocol, most published collaborative filtering
algorithms have similar prediction accuracies. For example, with the EachMovie data set, the accuracy improvement
in going from a naive prediction based on per-item average (learning-free) to a sophisticated Factor Analysis approach
is of no more than 17% [4]. Similarly, extensive work has been done to improve the Pearson correlation approach
[3, 8] and yet, accuracy improvements do not exceed 20%. The differences between inexpensive schemes and more
sophisticated ones are even smaller when one upgrades simple averaging scheme to the Bias From Mean algorithm
introduced by Herlocker et al. [8]. In the results presented in this paper, the difference between the best and the
worse scheme is of the order of 33% irrespective of the data set. In this context, systematic improvements by small
percentages are significant.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

7 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
29% Student (Master)
 
29% Professor
 
14% Ph.D. Student
by Country
 
29% United States
 
29% Canada
 
14% China