Motion-based finger tracking for user interaction with mobile devices
Visual Media Production 2007 IETCVMP 4th European Conference on (2007)
- ISBN: 9780863418
- DOI: 10.1049/cp:20070038
Available from link.aip.org
or
Author-supplied keywords
Available from link.aip.org
Page 1
Motion-based finger tracking for user interaction with mobile devices
MOTION-BASED FINGER TRACKING FOR USER
INTERACTION WITH MOBILE DEVICES
Jari Hannuksela, Sami Huttunen, Pekka Sangi and Janne Heikkila¨
Machine Vision Group, Infotech Oulu
Department of Electrical and Information Engineering
P.O. Box 4500, FIN-90014 University of Oulu, Finland
ffirstname.lastnameg@ee.oulu.fi
Keywords: Motion features, Kalman filter, EM-algorithm,
Motion estimation
Abstract
A new motion-based tracking algorithm for user interaction
with hand-held mobile devices is presented. The idea is to
allow mobile phone users to control the device simply by
moving a finger in front of a camera. A novel combination of
the Kalman filtering and the Expectation Maximization (EM)
algorithm is utilized for estimation of two distinct motion
components corresponding to the camera motion and the
finger motion. The estimation is based on motion features,
which are effectively extracted from the scene for each image
frame. The performance of the technique is evaluated in
experiments which show the usefulness of the approach.
The method can be applied also when some conventional
finger tracking techniques such as color segmentation and
background subtraction can not be used.
1 Introduction
Interaction with hand-held mobile devices has become a part
of our everyday life. Although the performance and capabilit-
ies of devices such as mobile phones have increased signific-
antly, user interfaces are still largely based on small displays
and keypads. Using unergonomic keys for other applications
than calling can sometimes be cumbersome because the num-
ber of buttons is limited and several key presses may be re-
quired for desired outcome. Various new and more sophistic-
ated technologies have been introduced for consumers in order
to improve controlling, for example, touch screens, voice re-
cognition and motion sensors [7]. However, with expensive
touch screens, both hands are needed for operation, and the
voice recognition often lacks of reliability. For motion sensors,
extra hardware need to be installed. Therefore, it is interesting
to consider also other approaches for interaction.
Increasing availability of mobile phones with a built-in high-
resolution camera and decent computational power has enabled
to utilize computer vision as an alternative solution. Recently,
the motion estimated from successive images has been used
to browse and navigate images on the display [4, 6] . These
methods allow users to control the device by tilting and mov-
ing the device in the hand. Furthermore, the motion input com-
bined with proper pattern recognition techniques can be applied
for more advanced interaction such as recognizing handwrit-
ing, gestures and signs [5, 12] . However, instead of moving
the device, we can achieve another intuitive way to interact by
moving a finger in the front of the camera.
Vision based finger tracking is well studied problem with nu-
merous applications. Crowley et al. [1] introduced FingerPaint
system in order to track pointing devices for a digital desk. A
finger was tracked using cross-correlation with a reference tem-
plate. Quek et al. [10] presented FingerMouse that utilizes
color segmentation to detect the finger and track the fingertip.
Jin et al. [8] proposed a finger writing character recognition
system. They used background subtraction to segment the fin-
ger from a cluttered background and detected fingertip based on
feature matching. Dominguez et al. [3] presented system using
color information to segment skin-like regions, shape analysis
to detect fingertip, and fingertip tracking based on Kalman fil-
tering.
Several techniques already exist for finger tracking, but these
are usually developed for static cameras, assume controlled
lighting conditions, use special markers and need too much
computing power to be utilized in mobile phones. For
example, skin color segmentation is unreliable in mobile
environment where lighting conditions change continually.
Also, methods based on background subtraction are not
applicable when the camera is moving and the background is
complex. Therefore, we propose a new method that utilizes
effective motion feature extraction for each frame in the image
sequence. These features are used to distinguish finger motion
from the background motion. The finger can be also replaced
with some other object such as a pen since we do not use any
shape or color analysis.
In addition, we present a novel method to estimate and track
two distinct motions. The method combines the Kalman fil-
tering algorithm [9] and the expectation maximization (EM)
algorithm [2] to estimate the parameters of two motion com-
ponents using motion features as measurements. To the best of
authors’ knowledge, this has not been done before. Although,
the Kalman filter and the EM-algorithm have been combined
to solve similar problems earlier, for example in [13], the EM-
algorithm has not been embedded into Kalman filter stages as
in our case. The benefit of this approach is that no iterations
INTERACTION WITH MOBILE DEVICES
Jari Hannuksela, Sami Huttunen, Pekka Sangi and Janne Heikkila¨
Machine Vision Group, Infotech Oulu
Department of Electrical and Information Engineering
P.O. Box 4500, FIN-90014 University of Oulu, Finland
ffirstname.lastnameg@ee.oulu.fi
Keywords: Motion features, Kalman filter, EM-algorithm,
Motion estimation
Abstract
A new motion-based tracking algorithm for user interaction
with hand-held mobile devices is presented. The idea is to
allow mobile phone users to control the device simply by
moving a finger in front of a camera. A novel combination of
the Kalman filtering and the Expectation Maximization (EM)
algorithm is utilized for estimation of two distinct motion
components corresponding to the camera motion and the
finger motion. The estimation is based on motion features,
which are effectively extracted from the scene for each image
frame. The performance of the technique is evaluated in
experiments which show the usefulness of the approach.
The method can be applied also when some conventional
finger tracking techniques such as color segmentation and
background subtraction can not be used.
1 Introduction
Interaction with hand-held mobile devices has become a part
of our everyday life. Although the performance and capabilit-
ies of devices such as mobile phones have increased signific-
antly, user interfaces are still largely based on small displays
and keypads. Using unergonomic keys for other applications
than calling can sometimes be cumbersome because the num-
ber of buttons is limited and several key presses may be re-
quired for desired outcome. Various new and more sophistic-
ated technologies have been introduced for consumers in order
to improve controlling, for example, touch screens, voice re-
cognition and motion sensors [7]. However, with expensive
touch screens, both hands are needed for operation, and the
voice recognition often lacks of reliability. For motion sensors,
extra hardware need to be installed. Therefore, it is interesting
to consider also other approaches for interaction.
Increasing availability of mobile phones with a built-in high-
resolution camera and decent computational power has enabled
to utilize computer vision as an alternative solution. Recently,
the motion estimated from successive images has been used
to browse and navigate images on the display [4, 6] . These
methods allow users to control the device by tilting and mov-
ing the device in the hand. Furthermore, the motion input com-
bined with proper pattern recognition techniques can be applied
for more advanced interaction such as recognizing handwrit-
ing, gestures and signs [5, 12] . However, instead of moving
the device, we can achieve another intuitive way to interact by
moving a finger in the front of the camera.
Vision based finger tracking is well studied problem with nu-
merous applications. Crowley et al. [1] introduced FingerPaint
system in order to track pointing devices for a digital desk. A
finger was tracked using cross-correlation with a reference tem-
plate. Quek et al. [10] presented FingerMouse that utilizes
color segmentation to detect the finger and track the fingertip.
Jin et al. [8] proposed a finger writing character recognition
system. They used background subtraction to segment the fin-
ger from a cluttered background and detected fingertip based on
feature matching. Dominguez et al. [3] presented system using
color information to segment skin-like regions, shape analysis
to detect fingertip, and fingertip tracking based on Kalman fil-
tering.
Several techniques already exist for finger tracking, but these
are usually developed for static cameras, assume controlled
lighting conditions, use special markers and need too much
computing power to be utilized in mobile phones. For
example, skin color segmentation is unreliable in mobile
environment where lighting conditions change continually.
Also, methods based on background subtraction are not
applicable when the camera is moving and the background is
complex. Therefore, we propose a new method that utilizes
effective motion feature extraction for each frame in the image
sequence. These features are used to distinguish finger motion
from the background motion. The finger can be also replaced
with some other object such as a pen since we do not use any
shape or color analysis.
In addition, we present a novel method to estimate and track
two distinct motions. The method combines the Kalman fil-
tering algorithm [9] and the expectation maximization (EM)
algorithm [2] to estimate the parameters of two motion com-
ponents using motion features as measurements. To the best of
authors’ knowledge, this has not been done before. Although,
the Kalman filter and the EM-algorithm have been combined
to solve similar problems earlier, for example in [13], the EM-
algorithm has not been embedded into Kalman filter stages as
in our case. The benefit of this approach is that no iterations
Page 2
are needed, which makes the algorithm computationally more
efficient.
The remainder of the paper is organized as follows. Section 2
describes the tracking algorithm in detail, and experimental
results with real image sequences are shown in Section 3.
Finally, Section 4 concludes the paper and discusses some
future work.
2 Tracking algorithm
The problem with finger tracking in mobile devices is that the
camera is usually moving, and the background in images is not
static. Therefore, background subtraction cannot be easily ap-
plied to extract the finger region. This leads us to consider a
motion based solution, where two distinct motions correspond-
ing to the background (camera) motion and foreground (camera
+ finger/hand) motion are estimated in a tracking framework.
Then, the background motion is subtracted from the finger mo-
tion in order to obtain the movement of interest. Integration of
motion provides us position and trajectory information required
for the application.
The models of the background and foreground motions are
based on assumption that the motions are constant but subject
to random perturbations. Translational models are considered
as sufficient approximations, and the dynamical model of
the background (j = 1) and foreground (j = 2) motions is
formulated as
~xj(k + 1) = ~xj(k) + ~"j(k), (1)
where ~xj(k) = [uj(k); vj(k)]T denotes the motion between
the frames k 1 and k, and ~"j(k) is the process noise term,
which is assumed to be zero-mean white Gaussian with co-
variance matrix ~Qj = 2j ~I . As foreground motion contains
both camera and finger motion, it is reasonable to assume that
22 >
2
1 .
The observation i of the displacement ~di is assumed to follow
the measurement model
~di(k) = ~i~x1(k) + (1 ~i)~x2(k) + ~i(k), (2)
where ~i(k) is the observation noise term, which is assumed to
obey zero-mean Gaussian distribution with covariance ~Ri, and
i is a hidden binary-valued assignment variable with the value
1 for the background and 0 for the foreground motion.
Motion observations (between frames k 1 and k) are based
on a set of block displacement estimates, which are obtained
through evaluation of the zero-mean sum of squared difference
(ZSSD) criterion. First, N image blocks are selected from the
anchor frame k 1 based on the gradient information. For each
block i (i = 1; : : : ; N ), ZSSD is evaluated for some range of
block displacements. Analysis of ZSSD values [11] provides
displacement estimates ~di = [ui; vi]T and related uncertainty
information, which is represented as a 2 2 covariance matrix
~Ci. As a result, we get a set of motion features Fi = (~di; ~Ci)
as illustrated in Figure 1.
Figure 1: Motion features. Estimates of feature block dis-
placements (lines) and associated error covariances (visual-
ized using error ellipses).
Assuming that the assignments i are known, Kalman filtering
[9] could be used to obtain optimal estimates of ~xj(k). The
algorithm estimates the state using two stages, prediction and
correction. In the first stage, the state of the system at the next
time instant is predicted based on the previous filtered state es-
timate and the dynamical model of the system. In the correc-
tion, the predicted state is corrected by using measurement in-
formation.
In our case, the assignments are not known and we need
to estimate them. To do this, the idea is to embed the
EM-algorithm [2] into the Kalman filter stages. The basic
assumption is that the motion measurements ~di are drawn from
either of two distributions corresponding to the background or
foreground. Having some estimate of distribution parameters,
we can evaluate measurements against those distributions to
obtain weights wi;j that are soft assignment variables in the
range [0,1]. The prediction step of the Kalman filter is used
to provide the estimates needed. Figure 2 shows an example
how the motion features presented in Figure 1 are assigned to
different components.
To describe the algorithm in more detail, let us denote the
filtered estimate of the state ~xj(k) with ~^x
+
j (k) and associated
estimation error covariance matrix with ~P+j (k). The steps
used to obtain the filtered estimate of the state at time instant
k + 1 are:
1. Applying dynamics (1), the predicted estimate ~^x
j (k +
1) and prediction error covariance matrix, ~P j (k + 1) are
given by
~^x
j (k + 1) = ~^x
+
j (k) (3)
and
~P j (k + 1) = ~P
+
j (k) + ~Qj : (4)
2. For each motion featureFi, the weightswi;j are computed
using a Bayesian formulation. Let j(k) > 0 be the a
priori probability of associating a feature with the motion
j (
P
j j(k) = 1). The weight wi;j is the a posteriori
probability given by (
P
j wi;j = 1)
wi;j / p( ~di j ~^x
j (k + 1); ~P
j (k + 1) + ~Ci ) j(k); (5)
efficient.
The remainder of the paper is organized as follows. Section 2
describes the tracking algorithm in detail, and experimental
results with real image sequences are shown in Section 3.
Finally, Section 4 concludes the paper and discusses some
future work.
2 Tracking algorithm
The problem with finger tracking in mobile devices is that the
camera is usually moving, and the background in images is not
static. Therefore, background subtraction cannot be easily ap-
plied to extract the finger region. This leads us to consider a
motion based solution, where two distinct motions correspond-
ing to the background (camera) motion and foreground (camera
+ finger/hand) motion are estimated in a tracking framework.
Then, the background motion is subtracted from the finger mo-
tion in order to obtain the movement of interest. Integration of
motion provides us position and trajectory information required
for the application.
The models of the background and foreground motions are
based on assumption that the motions are constant but subject
to random perturbations. Translational models are considered
as sufficient approximations, and the dynamical model of
the background (j = 1) and foreground (j = 2) motions is
formulated as
~xj(k + 1) = ~xj(k) + ~"j(k), (1)
where ~xj(k) = [uj(k); vj(k)]T denotes the motion between
the frames k 1 and k, and ~"j(k) is the process noise term,
which is assumed to be zero-mean white Gaussian with co-
variance matrix ~Qj = 2j ~I . As foreground motion contains
both camera and finger motion, it is reasonable to assume that
22 >
2
1 .
The observation i of the displacement ~di is assumed to follow
the measurement model
~di(k) = ~i~x1(k) + (1 ~i)~x2(k) + ~i(k), (2)
where ~i(k) is the observation noise term, which is assumed to
obey zero-mean Gaussian distribution with covariance ~Ri, and
i is a hidden binary-valued assignment variable with the value
1 for the background and 0 for the foreground motion.
Motion observations (between frames k 1 and k) are based
on a set of block displacement estimates, which are obtained
through evaluation of the zero-mean sum of squared difference
(ZSSD) criterion. First, N image blocks are selected from the
anchor frame k 1 based on the gradient information. For each
block i (i = 1; : : : ; N ), ZSSD is evaluated for some range of
block displacements. Analysis of ZSSD values [11] provides
displacement estimates ~di = [ui; vi]T and related uncertainty
information, which is represented as a 2 2 covariance matrix
~Ci. As a result, we get a set of motion features Fi = (~di; ~Ci)
as illustrated in Figure 1.
Figure 1: Motion features. Estimates of feature block dis-
placements (lines) and associated error covariances (visual-
ized using error ellipses).
Assuming that the assignments i are known, Kalman filtering
[9] could be used to obtain optimal estimates of ~xj(k). The
algorithm estimates the state using two stages, prediction and
correction. In the first stage, the state of the system at the next
time instant is predicted based on the previous filtered state es-
timate and the dynamical model of the system. In the correc-
tion, the predicted state is corrected by using measurement in-
formation.
In our case, the assignments are not known and we need
to estimate them. To do this, the idea is to embed the
EM-algorithm [2] into the Kalman filter stages. The basic
assumption is that the motion measurements ~di are drawn from
either of two distributions corresponding to the background or
foreground. Having some estimate of distribution parameters,
we can evaluate measurements against those distributions to
obtain weights wi;j that are soft assignment variables in the
range [0,1]. The prediction step of the Kalman filter is used
to provide the estimates needed. Figure 2 shows an example
how the motion features presented in Figure 1 are assigned to
different components.
To describe the algorithm in more detail, let us denote the
filtered estimate of the state ~xj(k) with ~^x
+
j (k) and associated
estimation error covariance matrix with ~P+j (k). The steps
used to obtain the filtered estimate of the state at time instant
k + 1 are:
1. Applying dynamics (1), the predicted estimate ~^x
j (k +
1) and prediction error covariance matrix, ~P j (k + 1) are
given by
~^x
j (k + 1) = ~^x
+
j (k) (3)
and
~P j (k + 1) = ~P
+
j (k) + ~Qj : (4)
2. For each motion featureFi, the weightswi;j are computed
using a Bayesian formulation. Let j(k) > 0 be the a
priori probability of associating a feature with the motion
j (
P
j j(k) = 1). The weight wi;j is the a posteriori
probability given by (
P
j wi;j = 1)
wi;j / p( ~di j ~^x
j (k + 1); ~P
j (k + 1) + ~Ci ) j(k); (5)
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
1 Reader on Mendeley
by Discipline
by Academic Status
100% Student (Master)
by Country
100% Netherlands


