Sign up & Download
Sign in

GaZIR: gaze-based zooming interface for image retrieval

by Laszlo Kozma, Arto Klami, Samuel Kaski
Proceedings of the 2009 international conference on Multimodal interfaces ACM POSTER SESSION Multimodal applications and techniques poster (2009)

Abstract

We introduce GaZIR, a gaze-based interface for browsing and searching for images. The system computes on-line predictions of relevance of images based on implicit feedback, and when the user zooms in, the images predicted to be the most relevant are brought out. The key novelty is that the relevance feedback is inferred from implicit cues obtained in real-time from the gaze pattern, using an estimator learned during a separate training phase. The natural zooming interface can be connected to any content-based information retrieval engine operating on user feedback. We show with experiments on one engine that there is sufficient amount of information in the gaze patterns to make the estimated relevance feedback a viable choice to complement or even replace explicit feedback by pointing-and-clicking.

Cite this document (BETA)

Page 1
hidden

GaZIR: gaze-based zooming interface for image retrieval

GaZIR: Gaze-based Zooming Interface for Image Retrieval
LÆszl Kozma
laszlo.kozma@tkk. Arto Klamiarto.klami@tkk. Samuel Kaskisamuel.kaski@tkk.
Helsinki Institute for Information Technology HIIT,
Department of Information and Computer Science,
Helsinki University of Technology
Finland
ABSTRACT
We introduce GaZIR, a gaze-based interface for browsing
and searching for images. The system computes on-line pre-
dictions of relevance of images based on implicit feedback,
and when the user zooms in, the images predicted to be the
most relevant are brought out. The key novelty is that the
relevance feedback is inferred from implicit cues obtained in
real-time from the gaze pattern, using an estimator learned
during a separate training phase. The natural zooming in-
terface can be connected to any content-based information
retrieval engine operating on user feedback. We show with
experiments on one engine that there is sufficient amount
of information in the gaze patterns to make the estimated
relevance feedback a viable choice to complement or even
replace explicit feedback by pointing-and-clicking.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Informa-
tion Search and Retrieval—Relevance feedback, Search pro-
cess; H.5.2 [Information Interfaces and Representa-
tion]: User interfaces—Input devices and strategies (e.g.,
mouse, touchscreen)
General Terms
Algorithms, Experimentation, Performance
Keywords
Gaze-based interface, image retrieval, implicit feedback, zoom-
ing interface
1. INTRODUCTION
In recent years image retrieval techniques operating on
meta-data, such as textual annotations or user-specified tags,
have become the industry standard for retrieval from large
image collections. They work well with sufficiently high-
quality meta-data, but the need for more content-based ap-
proaches operating on low-level features extracted from the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for pro t or commercial advantage and th at copies
bear this notice and the full citation on the rst page. To cop y otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speci c
permission and/or a fee.
ICMI-MLMI’09, November 2 4, 2009, Cambridge, MA, USA.
Copyright 2009 ACM 978-1-60558-772-1/09/11 ...$10.00.
image content is still apparent. Content-based techniques
are useful for refining results of keyword searches and, more-
over, the available meta-data may not be sufficiently rich for
all queries.
In content-based image retrieval (CBIR) there has been a
lot of research on the retrieval accuracy, developing better
feature descriptions, improving the actual retrieval engines,
and refining evaluation metrics, resulting in search engines
[3]. To focus the search, the engines typically collect explicit
feedback from the user, about which of the shown images are
relevant.
We study whether it would be possible to make the in-
terface between the user and the search engine more fluent
and natural, by collecting the feedback implicitly from what
the user would do in any case. We will separate explicit
control and implicit feedback, and make the former intu-
itive to exercise and the latter as informative as possible. In
brief, the user will explicitly request for more (better) im-
ages by zooming in the interface, and the implicit feedback
is inferred from gaze tracking data while the user looks at
the images. This paper is a feasibility study on whether it is
possible to construct such an interface, and whether it works
in practice with an existing CBIR engine.
The main research question is to what extent the explicit
relevance feedback can be augmented or eventually replaced
by implicit relevance feedback inferred from the actions the
user would perform in any case, the idea being that remov-
ing a separate relevance feedback phase will make the in-
terface more natural and faster to use [7]. As a practical
information source, we use cues obtained by measuring the
eye movements of the user, following the success of earlier
attempts in inferring relevance from eye movements in text
retrieval [2, 4, 11]. As far as we know, there have so far only
been preliminary studies related to use of implicit gaze in-
formation in image retrieval [8, 10]. Oyekoya et al. present
a simple retrieval system that infers relevance from straight-
forward viewing time [10], whereas Klami et al. introduce a
more complex relevance predictor but only measure isolated
prediction performance in a simplified artificial setup [8]. We
combine the two approaches, developing an even more so-
phisticated relevance predictor and integrating it with a real
retrieval engine. Furthermore, we design the user interface
specifically for gaze-based interaction.
Eye movements as a source of implicit relevance feedback
have three major advantages. First, the user will by defini-
tion need to look at the images in order to make the decision
on relevance, and hence if relevance feedback can be inferred
from eye movements it will be completely effortless for the
305
Page 2
hidden
user. The user just “looks at the images” as he normally
would. Secondly, the rich implicit feedback from eye move-
ments may help in the extremely hard problem of solving “I
will know it when I see it”-type of search tasks, where the
goal is ill-defined at best. Such tasks cannot be solved even
with meta-data if the user is not able to formulate explicit
queries. The third main advantage of using gaze tracking is
that with suitable hardware it is usable in mobile settings
when the hands cannot be used, and for users with motor
disabilities [1, 16]. While commercial gaze trackers are not
yet wide-spread, recent developments [14] suggest that low-
cost, robust eye tracking will be possible in the near future
also in standard desktop and mobile devices.
Gaze is used to explicitly guide the interface in Dasher,
a system for gaze-based text entry [16] which has been one
source of inspiration for our work. In Dasher a language
model will offer choices for the next letters to be typed,
with size of the letters on the display being proportional to
their predicted likelihood of being selected. Then the user
will look at the next letter in a zooming interface where new
letters will appear with speed controlled by gaze as well.
In our case the letters correspond to images, of which the
ones predicted to be most relevant are shown, and the user
scrolls to get more images. Most of the other features of the
systems are different, however; most notably the explicit vs
implicit feedback by the gaze. In explicit gaze-driven setups,
care has to be exercised to avoid the “midas touch” effect,
that it is tiring to use the eyes explicitly as control devices
for long because everything you look at will be selected [6].
Implicit feedback should not suffer from the same problem—
the intent is not that the user controls the system with eyes,
but instead that information is extracted from natural eye
movements.
Several techniques have been proposed for visualization
and navigation of large image collections, including meth-
ods like zooming and other distortions for displaying the
contents, and tree- or cone-like structures for organizing the
image collection. A comprehensive review of visual inter-
faces can be found in [17]. Our interface borrows elements
from this body of research, the main goal and novelty be-
ing in facilitating the interaction with gaze. The remaining
visualization decisions were made to create a simple and in-
tuitive interface.
In the remainder of the paper, we first describe the in-
terface and how it interacts with the gaze tracker and the
retrieval engine. Then we explain how the relevance of im-
ages is predicted from the gaze tracking measurements, and
demonstrate with user experiments that the accuracy of the
relevance predictions is relatively high. Finally, we per-
form preliminary experiments on actual image retrieval ac-
curacy using the learned relevance predictor. The approach
is shown to have promising performance with high accuracy
in certain kinds of search tasks. Work still remains, how-
ever; in particular, the performance is not high for all users
and search tasks.
2. GAZIR
2.1 Interface
The browsing interface is designed to elicit and collect
maximal amount of information from gaze while still being
a natural interface for browsing the image collection. Fig-
ure 1 illustrates the interface, showing three concentric rings
Figure 1: Screenshot of the GaZIR interface. Rele-
vance feedback gathered from outer rings influences
the images retrieved for the inner rings, and the user
can zoom in to reveal more rings.
of images. The outermost ring contains the first ten images
shown to the user, the second ring shows images retrieved
given the relevance feedback collected from the outermost
ring, and the innermost ring takes into account feedback
from the two previous rings. The user can zoom the in-
terface inwards and outwards. When zooming inwards the
system retrieves another set of images, using all the previ-
ous images and their estimated relevancies as feedback, and
eventually the older rings will disappear from the display.
They can, however, be recalled by zooming out, and the re-
trieval process can be restarted from any stage by erasing
the rings inside the current main ring.
The concentric rings of images were chosen instead of the
standard grid-based thumbnail display of most image re-
trieval interfaces, in order to avoid imposing gaze trajecto-
ries based on the structure of the display instead of the con-
tent. On a standard grid the users are likely to go through
the images in a row-by-row manner, considerably lowering
the amount of relevance information the eye movements con-
tain. Completely random placement of images would break
this pattern optimally, but a user is likely to find such an
interface unpleasant to use. A circle of images provides a
compromise between these two goals. It does not lead to
scanning patterns as strongly fixed as a grid would, allow-
ing image content to play bigger role in determining where
to look, yet it is sufficiently close to standard user interfaces
to feel intuitive.
For the purpose of learning the relevance predictor and
studying the interface, we perform the experiments in this
paper with two simplifications. First, the user is only ex-
pected to zoom inwards and not to reset the retrieval process
at any stage. Second, the retrieval engine is set to operate
in a sequential manner: A new set of images is fetched only
when the user zooms in and they are not updated after-
wards. An alternative would be to continuously update the
set of images on inner rings when the relevance estimates
on the outer rings change. These simplifications were made
so that we could collect reliable ground truth for learning
the relevance predictor. Finally, in the experiments we used
mouse wheel for zooming in and out to make the gaze based
interaction completely implicit. The interface can alterna-
tively be zoomed with explicit eye control (looking at the
center zooms in and looking at the borders zooms out).
306

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

4 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
75% Ph.D. Student
 
25% Doctoral Student
by Country
 
50% Germany
 
25% Spain
 
25% United States