Sign up & Download
Sign in

Cross recurrence quantification for cover song identification

by Joan Serrà, Xavier Serra, Ralph G Andrzejak
New Journal of Physics (2009)

Abstract

There is growing evidence that nonlinear time series analysis techniques can be used to successfully characterize, classify, or process signals derived from real-world dynamics even though these are not necessarily deterministic and stationary. In the present study, we proceed in this direction by addressing an important problem our modern society is facing, the automatic classification of digital information. In particular, we address the automatic identification of cover songs, i.e. alternative renditions of a previously recorded musical piece. For this purpose, we here propose a recurrence quantification analysis measure that allows the tracking of potentially curved and disrupted traces in cross recurrence plots (CRPs). We apply this measure to CRPs constructed from the state space representation of musical descriptor time series extracted from the raw audio signal. We show that our method identifies cover songs with a higher accuracy as compared to previously published techniques. Beyond the particular application proposed here, we discuss how our approach can be useful for the characterization of a variety of signals from different scientific disciplines. We study coupled Rössler dynamics with stochastically modulated mean frequencies as one concrete example to illustrate this point.

Cite this document (BETA)

Available from Xavier Serra and Joan Serrà's profiles on Mendeley.
Page 1
hidden

Cross recurrence quantification for cover song identification

The open–access journal for physics
New Journal of Physics
Cross recurrence quantification for cover song
identification
Joan Serrà1, Xavier Serra and Ralph G Andrzejak
Department of Information and Communication Technologies,
Universitat Pompeu Fabra, Roc Boronat 138, 08018 Barcelona, Spain
E-mail: joan.serraj@upf.edu
New Journal of Physics 11 (2009) 093017 (20pp)
Received 22 July 2009
Published 15 September 2009
Online at http://www.njp.org/
doi:10.1088/1367-2630/11/9/093017
Abstract. There is growing evidence that nonlinear time series analysis
techniques can be used to successfully characterize, classify, or process signals
derived from real-world dynamics even though these are not necessarily
deterministic and stationary. In the present study, we proceed in this direction
by addressing an important problem our modern society is facing, the automatic
classi cation of digital information. In particular, we address the automatic
identi cation of cover songs, i.e. alternative renditions of a previously recorded
musical piece. For this purpose, we here propose a recurrence quanti cation
analysis measure that allows the tracking of potentially curved and disrupted
traces in cross recurrence plots (CRPs). We apply this measure to CRPs
constructed from the state space representation of musical descriptor time series
extracted from the raw audio signal. We show that our method identi es cover
songs with a higher accuracy as compared to previously published techniques.
Beyond the particular application proposed here, we discuss how our approach
can be useful for the characterization of a variety of signals from different
scienti c disciplines. We study coupled R ssler dynamics with stochastically
modulated mean frequencies as one concrete example to illustrate this point.
1 Author to whom any correspondence should be addressed.
New Journal of Physics 11 (2009) 093017
1367-2630/09/093017+20$30.00 © IOP Publishing Ltd and Deutsche Physikalische Gesellschaft
Page 2
hidden
2Contents
1. Introduction 2
2. Method 5
2.1. Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2. State space embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3. CRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4. Recurrence quanti cation measures for cover song identi cation . . . . . . . . 7
3. Evaluation 11
3.1. Evaluation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4. Results 12
4.1. Parameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2. Out-of-sample accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3. Comparison with the state-of-the-art . . . . . . . . . . . . . . . . . . . . . . . 14
5. Conclusion 15
6. Outlook 16
Acknowledgments 18
References 18
1. Introduction
An unprecedented growth in the availability of and access to digital information is taking place
in today’s society, and music is a paradigmatic example. Online digital music collections are in
the order of millions of tracks, and personal collections can easily exceed the practical limits on
the time to listen to them [1]. This huge amount of information readily accessible for end users
poses major challenges for automatically describing, understanding, searching, retrieving, and
organizing musical contents. Music information retrieval (MIR) is the interdisciplinary research
eld that deals with these challenges [2].
MIR systems use multiple sources of information: the raw audio signal, symbolic music
representations, audio metadata, tags provided by users or experts, music and social networks
data, etc. In content-based MIR, much effort is focused on extracting information from the raw
audio signal to represent certain musical aspects such as timbre, melody, main tonality, chords,
or tempo [1]. Usually, these features are computed in a short-time moving window either from
a temporal, spectral, or cepstral representation of the audio signal [1], leading to a descriptor
time series re ecting the temporal evolution of a given musical aspect. While common MIR
strategies characterize these time series by means of statistical modeling or machine learning
techniques [3] [5], raw descriptor time series are used for many tasks such as audio alignment
and matching [6], song structure analysis [7], music similarity [8], audio ngerprinting [9], or
cover song identi cation [10] [18].
A cover song is an alternative version, performance, rendition, or recording of a previously
recorded musical piece. While cover songs might differ from their originals in several musical
aspects such as timbre, tempo, song structure, main tonality, arrangement, lyrics, or language
of the vocals, they resemble their originals with regard to other features. A robust so-called
‘mid-level feature’ that is largely preserved under the mentioned musical variations is the
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 3
hidden
3tonal sequence. Tonal sequences can be understood as series of different notes. These notes
can be played alone for each time slot (a melody) or can be played simultaneously with
other notes (chord or harmonic progressions). Methods for automatic cover song identi cation
usually exploit tonal sequence similarity and attempt to be robust against common changes in
other musical aspects [18]. In general, they either aim to extract the predominant melody, a
chord progression, or a chroma time series (a mid-level feature representing harmonic content)
from the raw audio signal and make it independent of the main tonality. Then, for obtaining
a similarity measure between songs, tonality descriptor time series are usually compared
by means of techniques like dynamic time warping, edit-distance variants, string matching
algorithms, subsequence hashing, or by common similarity functions (for an overview see [18]).
Cover song identi cation has recently become a very active area of study in the MIR
community [10] [18]. From a research point of view, cover song identi cation is a task
where the relation between songs is context-independent and can be quantitatively de ned and
objectively measured. It expands the notions of music similarity beyond acoustic resemblance
to include the important idea that musical works retain their identity despite variations in
many musical aspects [20]. From a practical and commercial point of view, quantifying music
similarity is the key to automatically searching and organizing music collections. Furthermore,
identifying cover songs has a direct implication to musical rights management and licenses. In
addition, from a user’s point of view, nding all versions of a particular song can be valuable
and fun.
The MIR evaluation exchange (MIREX) is an international community-based framework
for the formal evaluation of MIR systems and algorithms [21]. Among other tasks, MIREX
allows comparing different algorithms for artist identi cation, genre classi cation, or music
transcription2. In particular, MIREX allows for an objective assessment of the accuracy of
different cover song identi cation algorithms. For that purpose, participants can submit their
algorithms as binary executables, and the MIREX organizers determine and publish the
algorithms’ accuracies and runtimes. The underlying music collections are never published or
disclosed to the participants, either before or after the contest. Therefore, participants cannot
tune their algorithms to the music collections used in the evaluation process.
For the 2007 edition of the MIREX cover song identi cation contest, our group submitted
an algorithm that we subsequently described in [17]. This algorithm, which used a speci cally
designed chroma similarity measure and a subsequence matching method, yielded the highest
accuracy of all algorithms submitted in 2007 and in earlier editions. For the 2008 edition, we
used a qualitatively novel approach. The cover song identi cation measure that we derived
from this approach (Qmax) and a composition of this measure with a simple post-processing
step (Q∗max) yielded the two highest accuracies of all algorithms submitted in 2008 and in
earlier editions. In particular, the accuracy of both Qmax and Q∗max clearly surpassed our earlier
algorithm proposed in [17].
The Qmax algorithm was submitted to the MIREX contest as a binary executable, and we
here disclose for the rst time the underlying procedure. While this algorithm shares MIR
pre-processing steps with [17], the crucial difference is that it involves techniques derived
from nonlinear time series analysis [22]. More speci cally, Qmax is a recurrence quanti cation
analysis (RQA) measure [23] [26] that is extracted from cross recurrence plots (CRPs) [27],
which are the bivariate generalization of classical recurrence plots (RPs) [28]. The framework
2 http://www.music-ir.org/mirexwiki/index.php/Main Page
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 4
hidden
4of nonlinear time series analysis offers a variety of techniques to quantify similarities between
dynamics based on signals measured from them. Among these techniques, the CRP seems most
suitable to analyze pairs of musical descriptor time series since it is de ned for pairs of signals of
different lengths and can easily cope with variations in the timescale and non-stationarities of the
dynamics [29, 30]. Here, we construct CRPs from delay coordinate state space representations
of multivariate descriptor time series of songs.
CRPs and RQA measures are known as very intuitive and powerful tools in
various disciplines such as astrophysics, earth sciences, engineering, biology, cardiology, or
neuroscience (see [26] and references therein). However, to the best of our knowledge, there are
no previous applications of CRPs and RQA measures to musical signals. In general, only few
studies apply nonlinear time series analysis to musical signals. In [31, 32], delay coordinates are
applied to raw audio signals with regard to audio analysis and visualization. In [33] [35], delay
coordinates are applied to musical descriptor time series with regard to genre classi cation,
user preferences, and timbre modeling. In [36], delay coordinates are applied to human speech
signals for the purpose of local projective noise reduction. Subsequently, in [37], an RQA
measure was de ned to automatically adjust the best neighborhood size for this local projection.
It should be noted that RPs and CRPs have certain analogies with commonly used MIR
methods. In particular, the so-called self-similarity matrix was introduced in [38] to visualize
music and audio tracks and later used in [39] for song structure segmentation or in [40] for
identifying components of an audio piece. Currently, self-similarity matrices are commonly
used for diverse tasks such as song structure analysis [7] or musical meter detection [41]. Cross
similarity matrices are used, either directly or indirectly, in audio matching algorithms [6] and
in some cover song identi cation methods [18]. However, in contrast to CRPs, these similarity
matrices do not apply any delay coordinate state space representation and are, in general, not
thresholded.
A brief overview of the Qmax algorithm and the resulting structure of this paper can be
outlined as follows. Given two songs, we rst extract their chroma descriptor time series and
transpose one song to the main tonality of the other (section 2.1). From this pair of multivariate
time series, we form state space representations of the two songs using delay coordinates
involving an embedding dimension m and time delay τ (section 2.2). From this state space
representation, we construct a CRP using a xed maximum percentage of nearest neighbors
κ (section 2.3). Subsequently, we use Qmax to extract features that are sensitive to cover song
CRP characteristics, which results in two additional parameters γo and γe. In particular, we
derive Qmax from a previously published RQA measure (Lmax, [28]), but adapt it in two steps
(via Smax) to the problem at hand (section 2.4). We evaluate our approach using a large collection
of musical pieces (section 3.1). This music collection was compiled prior to and independently
from the present study and our participation in the MIREX contest. We use a subset of this
music collection and a standard information retrieval (IR) evaluation methodology (section 3.2)
to, at rst, perform an in-sample optimization of parameters m, τ , κ , γo and γe (section 4.1).
We subsequently report the out-of-sample accuracy with optimized parameters of Lmax, Smax,
and Qmax in identifying cover songs (section 4.2). All these steps were carried out before
we submitted the resulting algorithm to the 2008 MIREX cover song identi cation contest
as a further out-of-sample validation. We review results of this 2008 and the 2007 editions
(section 4.3) before we draw our conclusions (section 5). As an outlook (section 6), we provide
concrete perspectives for future applications of our technique. For this purpose we use coupled
R ssler dynamics with stochastically modulated mean frequencies.
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 5
hidden
5C C# D D# E F F# G G# A A# B
0
0.2
0.4
0.6
0.8
1
Pitch class

R
e
l
a
t
i
v
e

e
n
e
r
g
y
Figure 1. Example of a PCP feature vector extracted from an audio window
of 464ms. This PCP corresponds to a C minor chord environment (it mostly
contains C, D# and G pitch classes), where the root pitch class (C) is
predominant.
2. Method
2.1. Pre-processing
The tonal sequence is the most important characteristic shared among covers. To estimate
tonal sequences of musical pieces one can employ chroma or pitch class pro le (PCP) features.
These are widely used in the MIR community [42] [45] and are proven to work well as
primary information for cover song identi cation systems [18]. For systems employing PCP see
[10, 13, 15, 16, 17, 19].
In general, PCP features are robust against non-tonal components (e.g. ambient noise
or percussive sounds) and independent of timbre and the speci c instruments used [45].
Furthermore, they are independent of a musical piece’s loudness and volume uctuations. PCP
features are derived from the frequency dependent energy in a given range (typically from
50 to 5000Hz) in short-time spectral representations (e.g. 100ms) of audio signals computed
in a moving window. This energy is usually mapped into an octave-independent histogram
representing the relative intensity of each of the 12 semitones of the western music chromatic
scale (12 pitch classes). To normalize with respect to loudness, this histogram can be divided by
its maximum value, thus leading to values between 0 and 1 ( gure 1).
We here use harmonic PCPs (HPCPs) [45]. These features share the aforementioned PCP
properties, but are based only on the peaks of the spectrum within a certain frequency band,
thereby they reduce the in uence of noisy spectral components. Furthermore, HPCPs are tuning
independent, so that the reference tone can be different from the standard tone A 440Hz. In
addition, they take into account the presence of harmonic frequencies. Except for that we here
use 12 instead of 36 HPCP bins, we use the same HPCP extraction procedure and parameters as
in [17], to which we refer for further details.
The computation of HPCPs in a moving window results in a multidimensional time
series x for each song, expressing its temporal tonal evolution x = {xh,i} for h = 1, . . . , H and
i = 1, . . . , N ∗x , where H = 12 is a routinely employed number of HPCP bins [42] [45] and N ∗x
represents the total number of windows ( gure 2). We here use windows of 464ms with no
overlap between subsequent windows.
The last pre-processing step consists in transposing one HPCP time series to the main
tonality of the other. A change in the main tonality is a common alteration when musicians
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 6
hidden
6Time (windows)
P
i
t
c
h

c
l
a
s
s
50 100 150 200 250 300 350
C
D
E
F
G
A
B
0.2
0.4
0.6
0.8
Figure 2. Example of an HPCP time series extracted using a moving window
from the song Day Tripper as performed by The Beatles.
perform cover versions. This is usually done to adapt the original composition to a different
singer or solo instrument, or just for aesthetic reasons. In HPCP representations, a change in
the main tonality is represented by a circular pitch class shift. Accordingly, one can reverse
this change using an appropriate circular shift of the pitch class components along the vertical
axis of an HPCP time series (e.g. to transpose the time series depicted in gure 2 from D to C,
one has to shift the pitch class components circularly up by two bins, i.e. two semitones, for all
windows). To determine the number of bins to transpose, we use the optimal transposition index
procedure proposed in [17] and extended and further evaluated in [46].
2.2. State space embedding
An HPCP time series is a multivariate representation of the temporal tonal evolution of a given
song X . Certainly, it does not represent a signal measured from a stationary dynamical system
which could be described by some equation of motion. Nonetheless, delay coordinates [47], a
tool that is routinely used in nonlinear time series analysis [22], can be pragmatically employed
to facilitate the extraction of information contained in an HPCP time series x (cf [36, 37]). In
particular, by evaluating vectors of sample sequences, delay coordinates allow one to assess
systems recurrences more reliably than by using only the scalar samples. One should note
that such a use of sequences of notes instead of isolated ones is essential in music [48] and
is important for melody perception and recognition [49].
Considering the temporal evolution of each individual pitch class, we construct a time
series of delay coordinate state space vectors x = {xi} for i = 1, . . . , Nx , with Nx = N ∗x − (m−
1)τ and
xi = (x1,i , x1,i+τ , . . . x1,i+(m−1)τ , x2,i , x2,i+τ , . . . x2,i+(m−1)τ , . . . xH,i , xH,i+τ , . . . xH,i+(m−1)τ ), (1)
where m is the unitless embedding dimension, and τ is the time delay in units of the number
of windows. For nonlinear time series analysis, an appropriate choice of m and τ is crucial to
extract meaningful information from noisy signals of nite length [22]. While recipes for the
estimation of optimal xed values of m and τ exist (e.g. the false nearest neighbors’ method and
the use of the auto-correlation function decay time [22]), we here study cover song identi cation
accuracy under variation of these parameters and select the best combination (section 4).
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 7
hidden
72.3. CRP
An RP is a straightforward way to visualize characteristics of similar system states attained at
different times [28]. For this purpose, two discrete time axes span a square matrix which is
lled with zeros and ones, typically visualized as white and black cells, respectively. Each black
cell at coordinates (i, j) indicates a recurrence, i.e. a state at time i which is similar to a state
at time j . Thereby, the main diagonal line is black. CRPs are constructed in the same way as
RPs, but now the two axes span a rectangular, not necessarily square matrix [27]. A CRP allows
one to highlight equivalences of states between two systems attained at different times. When
a CRP is used to characterize distinct systems, the main diagonal is, in general, not black, and
any diagonal path of connected black cells represents similar state sequences exhibited by both
systems [26].
To analyze dependencies between two different signals x and y, here representing two
songs, we compute a CRP R from
Ri, j =2(εxi −‖xi − y j‖)2(εyj −‖xi − y j‖) (2)
for i = 1, . . . , Nx and j = 1, . . . , Ny , where xi and y j are state space representations of songs
X and Y at windows i and j , respectively, 2(·) is the Heaviside step function (2(v)= 0 if
v < 0 and 2(v)= 1 otherwise), εx
i
and ε
y
j
are two different threshold distances, and ‖ · ‖ is
some norm. We here use the Euclidean norm. Note that by equation (2) Ri, j = 1 if and only if
xi is a neighbor of y j and y j is a neighbor of xi .
The thresholds εx
i
and ε
y
j
are adjusted such that a maximum percentage of neighbors κ
is used for both xi and y j . In this way, the total number of nonzero entries in each row and
column never exceeds κNy and κNx , respectively. In-line with studies on the identi cation of
deterministic signals in noisy environments [27], in pre-analysis we found the use of a xed
percentage of neighbors κ superior to the use of a xed threshold ε. We study the in uence of
the parameter κ in section 4.
In general, pairs of unrelated songs result in CRPs that exhibit no evident structure,
while CRPs constructed for two cover songs show distinct extended patterns ( gure 3). These
extended patterns usually correspond to similar sections, phrases, or progressions between both
musical pieces X and Y .
2.4. Recurrence quanti cation measures for cover song identi cation
Given a CRP representation of two songs, we require a quantitative criterion to determine
whether they are covers or not. In pre-analysis, we tested different RQA measures [26] as
input for binary classi ers such as trees or support vector machines in combination with several
feature selection algorithms3 [50]. This analysis showed that the maximal length of diagonal
lines (Lmax) feature yielded by far the highest discriminative power between CRPs from covers
and non-covers. All other RQA measures that we tried (recurrence rate, determinism, average
diagonal length, entropy, ratio, laminarity, trapping time, maximal length of horizontal or
vertical lines [26], and combinations of them) were found to have no or very low discriminative
power.
3 We use the Weka data mining software: http://www.cs.waikato.ac.nz/ml/weka
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 8
hidden
8Time (windows)
T
i
m
e

(
w
i
n
d
o
w
s
)
(a)
50 100 150 200 250 300 350
50
100
150
200
250
300
350
400
450
500
550
Time (windows)
T
i
m
e

(
w
i
n
d
o
w
s
)
(b)
50 100 150 200 250 300 350
50
100
150
200
250
300
350
400
Figure 3. CRPs for the song Day Tripper as performed by The Beatles, taken
as song X , versus two different songs, taken as song Y . These are a cover made
by the group Ocean Colour Scene (a) and the song I’ve Got a Crush on You as
performed by Frank Sinatra (b). Parameters are m = 9, τ = 1, and κ = 0.08.
The Lmax measure introduced in [28] can be expressed as the maximum value of a
cumulative matrix L computed from the CRP. We initialize L1, j = L i,1 = 0 for i = 1, . . . , Nx
and j = 1, . . . , Ny , and then recursively apply
L i, j =
{
L i−1, j−1 + 1, if Ri, j = 1,
0, if Ri, j = 0,
(3)
for i = 2, . . . , Nx and j = 2, . . . , Ny , and de ne Lmax =max{L i, j} for i = 1, . . . , Nx and j =
1, . . . , Ny .
To understand why Lmax is performing so well we depict some example CRPs, where
we use the same song for X and three different songs for Y ( gure 4). A high Lmax value
is obtained when X and Y are covers ( gure 4(a)), whereas a low value is obtained when
that is not the case ( gure 4(c)). An intermediate value is obtained for two songs that share
a common tonal progression, but only for brief periods ( gure 4(b)). It turns out that this
particular example of gure 4(b) is a border case where one would consider the two songs
to be covers or not. The two songs are very different even in terms of main melody and tonality,
but still they share a very characteristic sample featuring a ute hook that forms the basis of both
songs4.
Diagonal patterns are clearly discernible in gures 4(a) and (b), and the longest of
these diagonals corresponds to the maximum time that X and Y evolve together without
disruptions (i.e. the maximal length of their shared tonal sequence). Note that only in gure 4(a)
the longest diagonal is found close to the main diagonal. However, that is not a necessary
criterion of Y being a cover of X ( gure 4(b)). In general, this depends on the musical
structure of the cover song. Often, new performers add, delete, or change the introduction,
4 http://news.bbc.co.uk/2/hi/entertainment/4354028.stm
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 9
hidden
9Time (windows)
T
i
m
e

(
w
i
n
d
o
w
s
)
100 200 300 400 500 600
100
200
300
400
500
Time (windows)
T
i
m
e

(
w
i
n
d
o
w
s
)
100 200 300 400 500 600
100
200
300
400
500
600
Time (windows)
T
i
m
e

(
w
i
n
d
o
w
s
)
(a) (b) (c)
100 200 300 400 500 600
100
200
300
400
Figure 4. CRPs for the song Gimme, Gimme, Gimme as performed by the group
ABBA, taken as song X , versus three different songs, taken as song Y . These are
a cover made by the group A-Teens (a), a techno performance of the song Hung
up by Madonna (b), and the song The Robots by Kraftwerk (c). In (a) Lmax = 43
starting at windows (118, 121), in (b) Lmax = 34 starting at windows (176, 130),
and in (c) Lmax = 16 starting at windows (373, 245). Parameters are the same as
in gure 3.
solo sections, endings, verses, and so forth. Thus, to account for structure changes, it is
necessary to consider any diagonal regardless of its position in the CRP. This allows one
to detect passages of a song that have been inserted in any part of another song. However,
while Lmax can account for such structural changes, it cannot account for tempo changes.
When covering a musical piece, musicians often adapt the tempo to their needs and, even in
a live performance of the original artist, this feature can change with respect to the original
recording. Tempo deviations between two cover songs result in the curving of CRP diagonal
traces.
To quantify the length of curved traces we therefore extend equation (3) and compute a
cumulative matrix S from the CRP. We initialize S1, j = S2, j = Si,1 = Si,2 = 0 for i = 1, . . . , Nx
and j = 1, . . . , Ny , and then recursively apply
Si, j =
{max{Si−1, j−1, Si−2, j−1, Si−1, j−2}+ 1, if Ri, j = 1,
0, if Ri, j = 0,
(4)
for i = 3, . . . , Nx and j = 3, . . . , Ny . Here, the maximum value Smax =max{Si, j} for i =
1, . . . , Nx and j = 1, . . . , Ny , corresponds to the length of the longest curved trace in the
CRP. This formulation is inspired by common alignment algorithms [51, 52], but constrains
the possible alignments by excluding horizontal and vertical paths. We should note that these
particular path connections (Si−1, j−1, Si−2, j−1, Si−1, j−2), which are only one aspect of equation
(4), were used before. They were found to work well for speech recognition in application to
distance matrices [53], and for cover song identi cation in application to the so-called optimal
transposition index-based binary similarity matrices [17].
Apart from tempo deviations, musicians might skip some chords or part of the melody
when performing cover songs. This practice leads to short disruptions in otherwise coherent
traces (see, e.g. gure 3(a)). Moreover, such disruptions can also be caused by the fact that
HPCP features might contain some energy not directly associated to tonal content. To account
for disruptions, we therefore extend equation (4) and compute a cumulative matrix Q from
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 10
hidden
10
Time (windows)
T
i
m
e

(
w
i
n
d
o
w
s
)
(a)
100 200 300
100
200
300
400
500
10
20
30
Time (windows)
T
i
m
e

(
w
i
n
d
o
w
s
)
(b)
100 200 300
100
200
300
400
500
20
40
60
Time (windows)
T
i
m
e

(
w
i
n
d
o
w
s
)
(c)
100 200 300
100
200
300
400
500
20
40
60
80
100
120
Figure 5. Day Tripper as performed by The Beatles, taken as song X , versus
Ocean Colour Scene performance, taken as song Y . Example plots of L (a),
S (b) and Q (c). Note the increase in the maximum values (colorscales). In
(a) Lmax = 33 starting at windows (140, 232), in (b) Smax = 79 starting at
windows (216,142), and in (c) Qmax = 136 starting at windows (14, 118). CRP
parameters are the same as in gure 3. Parameters for (c) are γo = 3 and γe = 7.
the CRP. We initialize Q1, j = Q2, j = Qi,1 = Qi,2 = 0 for i = 1, . . . , Nx and j = 1, . . . , Ny , and
then recursively apply
Qi, j =



max{Qi−1, j−1, Qi−2, j−1, Qi−1, j−2}+ 1, if Ri, j = 1,
max{0, Qi−1, j−1 − γ (Ri−1, j−1),
Qi−2, j−1 − γ (Ri−2, j−1),
Qi−1, j−2 − γ (Ri−1, j−2)}, if Ri, j = 0,
(5)
for i = 3, . . . , Nx and j = 3, . . . , Ny , with
γ (z)=
{
γo, if z = 1,
γe if z = 0.
(6)
Hence γo is a penalty for a disruption onset and γe is a penalty for a disruption extension.
The zero inside the second max clause in equation (5) is used to prevent that these penalties
lead to negative entries of Q. Note that for γo, γe →∞, equation (5) becomes equation (4).
For γo = γe = 0, Qi, j becomes a cumulative value indicating global similarity between two
time series starting at sample 0 and ending at samples i and j , respectively. Note that this has
certain analogies with classical dynamic time warping algorithms [51]. Instead of xing γo and
γe a priori, we study their in uence on the accuracy of our cover song identi cation system
(section 4). Analogously to Lmax and Smax, we take Qmax =max{Qi, j} for i = 1, . . . , Nx and
j = 1, . . . , Ny to quantify the length of the longest curved and potentially disrupted trace in
the CRP.
For illustration we depict some examples for the three quanti cation measures discussed in
this section ( gure 5). The Lmax measure ( gure 5(a)) characterizes straight diagonals regardless
of their position. The Smax measure can account for tempo uctuations resulting in curved traces
( gure 5(b)). Furthermore, the Qmax measure allows for disruptions of the tonal progression
( gure 5(c)).
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 11
hidden
11
2 4 6 8 10 12 14 16 18
25
50
75
100
125
150
175
200
Cover set cardinality
N
u
m
b
e
r

o
f

c
o
v
e
r

s
e
t
s
PR E JB WM C M
200
400
600
800
1000
1200
Genres
N
u
m
b
e
r

o
f

s
o
n
g
s
(a) (b)
Figure 6. Distribution of the cover set cardinality (a) and the distribution of
genres across all songs (b). PR stands for pop-rock, E for electronic, JB for
jazz-blues, WM for world music, C for classical and M for miscellaneous.
3. Evaluation
3.1. Evaluation data
To test the effectiveness of the implemented approaches, we analyze a music collection
comprising a total of 1953 commercial songs with an average song length of 3.5min, ranging
from 0.5 to 7min. These songs include 500 cover sets, where cover set refers to a group of
versions of the same song. The average cardinality of these cover sets (i.e. the number of songs
per cover set) is 3.9, ranging from 2 to 18 ( gure 6(a)). In composing this music collection
we aimed at including a variety of styles and genres ( gure 6(b)). No further criterion for
the inclusion or exclusion of songs was applied. A complete list of the music collection can
be found (http://mtg.upf.edu/people/jserra/). This music collection was compiled prior to and
independently from the present study.
In order to form a training and three testing music collections, we split the total number of
500 cover sets into three non-overlapping subsets. The training collection contains 90 songs
consisting of 15 cover sets of cardinality 6. The rst testing collection contains 330 songs
divided into 30 cover sets of cardinality 11. The second testing collection contains the remaining
455 cover sets each having cardinalities between 2 and 18, resulting in a total of 1533 songs.
A further testing collection is de ned as the union of rst and second testing collections.
3.2. Evaluation methodology
Given a music collection with D songs, we calculate Lmax, Smax and Qmax for all D(D−1)2
possible pairwise combinations. Once such a similarity matrix is computed as primary source
of information, we can resort to standard IR measures to evaluate the discriminative power of
this information. We use the mean of average precision measure [54], which we denote as9. To
calculate this measure, the similarity matrix is used to compute a list 3q of D− 1 songs sorted
in descending order with regard to their similarity to song q . Suppose that the query song q
belongs to a cover set comprising Cq + 1 songs. Then, the average precision ψq is obtained as
ψq =
1
Cq
D−1∑
r=1
Pq(r)Iq(r), (7)
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 12
hidden
12
where Pq(r) is the precision of the sorted list 3q at rank r ,
Pq(r)=
1
r
r∑
l=1
Iq(l), (8)
and Iq(·) is the so-called relevance function (Iq(u)= 1 if the song with rank u in the sorted list
is a cover of q, and Iq(u)= 0 otherwise). Hence ψq ranges between 0 and 1. If the cover songs
take the rst Cq ranks, we obtain ψq = 1. If all cover songs are found towards the end of 3q ,
we obtain values close to 0. The 9 measure is calculated as the mean of average precisions ψq
across all queries q. This evaluation measure is routinely employed in a wide variety of tasks
in the IR [54] and MIR communities, including the MIREX cover song identi cation task [20].
Using equations (7) and (8) has the advantage of taking into account the whole sorted list where
correct items with low rank receive the largest weights.
Additionally, we estimate the accuracy level expected under the null hypothesis that
the similarity matrix has no discriminative power with regard to the assignment of cover
sets. For this purpose, we separately permute 3q for all q and all other steps remain the
same. We repeat this process 19 times, corresponding to a signi cance level of 0.05 of this
Monte Carlo null hypothesis test, and take the average, resulting in 9null. This 9null can be
used to estimate the accuracy of all measures Lmax, Smax and Qmax under the speci ed null
hypothesis.
4. Results
4.1. Parameter optimization
We use the training collection to study the in uence of the embedding parameters m and τ and
the percentage of nearest neighbors κ on our accuracy measure 9. Results for Qmax ( gure 7)
illustrate that the use of an embedding (m > 1) improves the accuracy of the algorithm as
compared to no embedding (m = 1). A broad peak of near-maximal 9 values is established for
a considerable range of embedding windows (approximately 7< (m− 1)τ < 17). From these
near-maximal values, 9 decreases weakly upon further increasing of the embedding window.
Optimal κ values are found between 0.05 and 0.15. Therefore, within these broad ranges of the
embedding window (m− 1)τ and κ values, no ne tuning of any of the parameters is required
to yield near-optimal accuracy. In the following we use m = 10, τ = 1 and κ = 0.1.
While accuracies shown in gure 7 are computed for a disruption onset γo = 2 and
disruption extension γe = 2 penalties, the in uence of these penalty parameters is further studied
in gure 8. Recall that γo and γe are introduced only in the de nition of Qmax and that for
γo, γe →∞, the measure Qmax (equation (5)) reduces to Smax (equation (4)). Using nite values
of these terms generally increases the accuracy, revealing the advantage of Qmax over Smax.
Optimal Qmax accuracy values are found for γo = 5 and γe = 0.5.
The same parameter optimization described above for Qmax was carried out separately
for Lmax and Smax, and m = 10, τ = 1 and κ = 0.1 led to near-optimal accuracies also for
these measures. Furthermore, no ne tuning was required since iso-τ and iso-m curves for
different κ values have similar shapes as the ones depicted for Qmax in gure 7. For the training
collection, this in-sample parameter optimization leads to the following accuracies ( gure 9(a)):
9Lmax = 0.640, 9Smax = 0.728 and 9Qmax = 0.813.
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 14
hidden
14
0
0.2
0.4
0.6
0.8
1
Ψ
(a)
N
u
l
l
L
m
a
x
S
m
a
x
Q
m
a
x
0
0.2
0.4
0.6
0.8
1
Ψ
(b)
N
u
l
l
L
m
a
x
S
m
a
x
Q
m
a
x
0
0.2
0.4
0.6
0.8
1
Ψ
(c)
N
u
l
l
L
m
a
x
S
m
a
x
Q
m
a
x
0
0.2
0.4
0.6
0.8
1
Ψ
(d)
N
u
l
l
L
m
a
x
S
m
a
x
Q
m
a
x
Figure 9. Mean average precision 9 for the training (a) and the three testing
collections (b) (d). Error margins in the leftmost bars correspond to the range
across to the 19 randomizations described in section 3.2.
accuracies indicate that our results cannot be explained by a parameter over-optimization. The
accuracy increase gained through the derivation from Lmax via Smax to Qmax is substantial. Most
importantly, this increase in accuracy is re ected in the testing collections as well. Moreover, all
values for Lmax, Smax and Qmax are signi cantly outside the range of 9null across the 19 Monte
Carlo randomizations. Therefore, our accuracy values are not consistent with the null hypothesis
that the similarity matrices have no discriminative power.
4.3. Comparison with the state-of-the-art
As stated in the introduction, the algorithm proposed in [17] as well as two algorithms based
on Qmax were submitted to the MIREX contest in 2007 and 2008, respectively. The MIREX
test collection is composed of 30 cover sets of cardinality 11 each [20]. Accordingly, the total
cover song collection contains 330 songs. Another 670 individual songs, i.e. cover sets of
cardinality 1, are added to make the identi cation task more dif cult. The entire music collection
includes a wide diversity of genres (e.g. pop, rock, classical, baroque, folk, jazz, etc), and the
variations span a variety of styles and orchestrations. Beyond this general description, no further
information about the test collection is published or disclosed to the participants. In particular,
only the MIREX organizers know what actual musical pieces are contained in the test collection.
Each of the 330 cover songs were used as query and the submitted algorithms were required to
return a 330 times 1000 distance matrix (one row for each query5). From this distance matrix,
several evaluation measures were computed by the MIREX organizers. In 2007 and 2008 the
same evaluation measures were applied, including 9 as the main reference.
The algorithm in [17] was found to be the most accurate one in the 2007 edition6
( gure 10(a), 9[17] = 0.521). The two most accurate algorithms in 2008 were based on
Qmax. The raw Qmax algorithm as presented here reached an accuracy7 of 9Qmax = 0.661
( gure 10(b)). It was only outperformed by an algorithm which included Qmax as described
here, plus one additional simple post-processing step applied to the similarity matrix derived
from Qmax (9Q∗max = 0.750). This post-processing step was proposed by our group and consists
detecting cover song sets instead of isolated songs [55]. More concretely, it applies an
unsupervised community detection algorithm operating to a complex network computed from
5 http://www.music-ir.org/mirex/2008/index.php/Audio Cover Song Identi cation
6 http://www.music-ir.org/mirex/2007/index.php/Audio Cover Song Identi cation Results
7 http://www.music-ir.org/mirex/2008/index.php/Audio Cover Song Identi cation Results
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 15
hidden
15
0
0.2
0.4
0.6
0.8
1
Ψ
(a)
A
1


A
2


A
3


A
4


A
5


A
6


A
7


[
1
7
]

0
0.2
0.4
0.6
0.8
1
Ψ
(b)
B
1
B
2
B
3
B
4
B
5
B
6
Q
m
a
x
Q
*
m
a
x
Figure 10. Mean average precision 9 for algorithms submitted to the MIREX
2007 (a) and 2008 (b) contests. A1 A7 and B1 B6 refer to algorithms submitted
by other participants and [17] refers to our previous work.
the pairwise Qmax matrix used in section 3.2 and normalize this matrix according to the detected
communities.
Most importantly, the9Qmax value obtained for the MIREXmusic collection is very close to
the 9Qmax values reported for the testing collections used here ( gures 9 and 10). This provides
evidence that the out-of-sample accuracy values reported in section 4.2 are not related to any
hidden in-sample optimization which could have been introduced involuntarily, for example, by
a biased selection of songs for the testing collections.
5. Conclusion
In the present work, we combine concepts from music signal processing, nonlinear time
series analysis, machine learning and IR to successfully identify covers of musical pieces.
The composition of concepts from these different disciplines, naturally results in a modular
organization of our method. Given two audio signals, we, at rst, use techniques from music
signal processing to extract descriptor time series representing their tonal progression. These
time series are then used for multivariate embedding by means of delay coordinates. To assess
equivalences of states between both systems attained at different times, we use CRPs and
recurrence quanti cation measures derived from them. In pre-analysis, existing recurrence
quanti cation measures were evaluated using machine learning techniques. The obtained result
motivated us to introduce new cross recurrence quanti cation measures Smax and Qmax. Using
standard IR evaluation measures we quantify the accuracy for the task at hand.
We here show that our algorithm leads to high accuracy for the cover song identi cation
task on a comprehensive music collection compiled prior to and independently from the present
study. This music collection is divided into non-overlapping testing and training collections.
We adjust the parameters on the training collection and then determine the accuracy out-
of-sample using different testing collections. Nonetheless, in such a study design, one could
still overestimate the true accuracy of the algorithm by involuntarily introducing biases in the
compilation of the music collection. However, the close match of the accuracy reported here
for our music collection and the one obtained for the MIREX contest supports the generality
of the reported results (recall that the music collection used here was compiled prior to
and independently from our participation to the MIREX contest). Furthermore, the proposed
algorithm reached the highest accuracies in the MIREX cover song identi cation task ever. This
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 16
hidden
16
illustrates its superiority in respect to current state-of-the-art algorithms, including our previous
approach [17].
One should note that the concept of delay coordinates has originally been developed for
the reconstruction of stationary deterministic dynamical systems from single variables measured
from them [22]. Also, the identi cation of coherent traces within the CRP is connected to the
notion of deterministic dynamics (see [26] and references therein). Certainly, musical pieces
do not represent the output of a stationary deterministic dynamical system, and therefore,
one could argue that applying concepts developed for deterministic systems to such signals
is inappropriate. However, if we consider a song as the output of some ‘complicated system’
evolving with time, and an HPCP as a multivariate time series measured from it, we can use
the method of delay coordinates to facilitate the extraction of the information characterizing the
underlying system. In fact, we nd that the accuracy of our cover song identi cation system
is signi cantly improved using an embedding, compared to not using it. In conclusion, our
work provides a further example for an application of nonlinear time series analysis methods to
experimental time series where the assumption of some underlying deterministic dynamics is
not ful lled in a strict sense, but which nonetheless allows one to successfully characterize the
system underlying the time series.
6. Outlook
In closing, we provide evidence that the Qmax measure proposed here is not restricted to MIR
nor to the particular application of cover song identi cation. Indeed, a quantitative assessment
of curved and disrupted traces in RPs and CRPs can be useful for the characterization of a
variety of experimental and arti cial signals.
As a concrete example for a physical setting, we study two R ssler dynamics
unidirectionally coupled by a diffusive term of strength ε:
x1(t)=−ωx(t)x2(t)− x3(t),
x2(t)= ωx(t)x1(t)+ 0.15x2(t),
x3(t)= [x1(t)− 10] x3(t)+ 0.2,
y1(t)=−ωy y2(t)− y3(t)+ ε [x1(t)− y1(t)] ,
y2(t)= ωy y1(t)+ 0.15y2(t),
y3(t)= [y1(t)− 10] y3(t)+ 0.2.
(9)
For our context, the key feature of this example is that the mean frequency of the driving
dynamics ωx(t) is varied while ωy = 1 is time-independent. We integrate equation (9) using
a fourth order Runge Kutta algorithm with xed step size of 1t = 0.05 time units and vary
ωx(t) according to
ωx( j1t)= 1 + 0.02ξ j , (10)
where ξ j is a strongly correlated rst-order autoregressive process
ξ j = 0.98ξ j−1 + η j (11)
with j being an integer and η j corresponding to uncorrelated Gaussian noise with zero mean
and unit variance. Note that ξ j has zero mean and a variance of approximately 24. We start
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 17
hidden
17
Time (6∆t)
T
i
m
e

(
6

t
)
(a)
325 425 525 625
325
425
525
625
Time (6∆t)
T
i
m
e

(
6

t
)
(b)
325 425 525 625
325
425
525
625
Figure 11. Exemplary CRP regions for x1(ti) and y1(ti) obtained from one
realization of the coupled R ssler dynamics (a) and the uncoupled ones (b).
the integration of equation (9) at random initial conditions and use a suf cient number of
pre-iterations to reduce transients. Time series pairs x1(ti) and y1(ti) are then sampled at
ti − ti−1 = 61t for a time series length of N ∗x = N ∗y = 2048 (i = 1, . . . , 2048).We compare results for coupled dynamics (equation (9) with ε = 0.4) versus uncoupled
dynamics (equation (9) with ε = 0). For both conditions, we generate a set of 2000 independent
realizations for each time series x1(ti) and y1(ti). We construct the CRP and extract Lmax, Smax
and Qmax from all time series pairs. We here use as parameters: m = 8, τ = 1, κ = 0.0125 and
γo = γe = 1. None of these parameters, nor the parameters of equations (9) (11), are optimized
in any way for the example presented here.
Regarding the CRP constructed from realizations of x1(t) and y1(t) for the coupled
dynamics, we nd curved and brie y disrupted traces along the main diagonal ( gure 11(a)).
These re ect the strong coupling and their interruptions and curvatures are caused by the
stochastically varying mean frequency of the driving R ssler oscillator. In contrast, only
dispersed patterns are observed for the uncoupled dynamics ( gure 11(b)). In consequence,
across all realizations, the distributions of Qmax values obtained for the coupled versus
uncoupled condition are almost non-overlapping ( gure 12(c)). Distributions of Lmax and Smax
in contrast overlap substantially ( gures 12(a) and (b), respectively). Hence, only Qmax allows
one to distinguish between these two conditions.
This example of coupled R ssler dynamics with stochastically varying mean frequencies
is meant to sketch only one potential application of Qmax. A systematic study of this setting and
the in uence of the various parameters is left for future work. Results of such a study can have
important implications for the analysis of interactions between brain oscillations and tremors
in Parkinson patients or between cardiac and respiratory dynamics. This holds since these
pathological and physiological processes are known to be characterized by mean frequencies
with irregular time-dependencies.
Furthermore, one should note that curved structures have been reported in RPs and CRPs
of arti cial and experimental signals. Arti cial signals include frequency modulated periodic
signals [29, 30, 56] or time series derived from R ssler dynamics with bidirectional couplings
close to the onset of phase synchronization [56]. Experimental data include signals with
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 18
hidden
18
0 100 200 300 400
0
500
1000
L
max
F
r
e
q
u
e
n
c
y
(a)
100 200 300 400 500
0
200
400
600
S
max
F
r
e
q
u
e
n
c
y
(b)
0 500 1000 1500 2000
0
500
1000
Q
max
F
r
e
q
u
e
n
c
y
(c)
Figure 12. Histograms for Lmax (a), Smax (b) and Qmax (c) obtained for the 2000
independent realizations of the coupled (red) and uncoupled (blue) dynamics.
nonlinearly re-scaled or distorted time axes such as geophysical data of sediment cores subjected
to different compressions [29], symbolic dynamic representations of EEG recordings from the
onsets of epileptic seizures [56], or acoustic signals from calls of primates [30]. Far beyond
these particular examples, it can be conjectured that important features of further experimental
signals, e.g. from bioinformatics [57], physiology [24], human speech processing [51], or
climatology [58], are re ected in curved and disrupted traces in RPs and CRPs. A quantitative
assessment of these traces by means of the proposed measures can thus help to characterize a
multitude of systems from different scienti c disciplines.
Acknowledgments
We thank D Chicharro, E G mez and P Herrera for useful discussions and M Koppenberger for
technical support. This research has been partially funded by the EU-IP project PHAROS IST-
2006-045035 and by the BFU2007-61710 Spanish Ministry of Education and Science grant.
References
[1] Casey M, Veltkamp R C, Goto M, Leman M, Rhodes C and Slaney M 2008 Content-based music information
retrieval: current directions and future challenges Proc. IEEE 96 668 96
[2] Orio N 2006 Music retrieval: a tutorial and review Found. Trends Inf. Retrieval 1 1 90
[3] Tzanetakis G and Cook P 2002 Musical genre classi cation of audio signals IEEE Trans. Speech Audio
Process. 5 293 302
[4] Aucouturier J J and Pachet F 2004 Improving timbre similarity: how high is the sky? J. Negative Results
Speech Audio Sci. 1 1
[5] Bergstra J, Casagrande N, Erhan D, Eck D and KØgl B 2006 Aggregate features and adaboost for music
classi cation Mach. Learn. J. 65 473 84
[6] M ller M 2007 Information Retrieval for Music and Motion (Berlin: Springer)
[7] Ong B S 2007 Structural analysis and segmentation of music signals PhD Thesis Universitat Pompeu Fabra,
Barcelona, Spain Available online: http://mtg.upf.edu/node/508
[8] Casey M, Rhodes C and Slaney M 2008 Analysis of minimum distances in high-dimensional musical spaces
IEEE Trans. Audio Speech Lang. Process. 16 1015 28
[9] Cano P, Batlle E, Kalker T and Haitsma J 2005 A review of audio ngerprinting J. VLSI Signal Process.
41 271 84
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 19
hidden
19
[10] Nagano H, Kashino K and Murase H 2002 Fast music retrieval using polyphonic binary feature vectors IEEE
Int. Conf. on Multimedia and Expo (ICME) vol 1, pp 101 4
[11] Tsai W H, Yu H M and Wang H M 2008 Using the similarity of main melodies to identify cover versions of
popular songs for music document retrieval J. Inf. Sci. Eng. 24 1669 87
[12] Izmirli 2005 Tonal similarity from audio using a template based attractor model Int. Symp. on Music
Information Retrieval (ISMIR) pp 540 5
[13] G mez E, Ong B S and Herrera P 2006 Automatic tonal analysis from music summaries for version
identi cation Conv. of the Audio Engineering Society (AES) CD-ROM, paper no. 6902
[14] Marolt M 2008 A mid-level representation for melody-based retrieval in audio collections IEEE Trans.
Multimedia 10 1617 25
[15] Ellis D P W and Poliner G E 2007 Identifying cover songs with chroma features and dynamic programming
beat tracking IEEE Int. Conf. on Acoustics Speech and Signal Processing (ICASSP) vol 4, pp 1429 32
[16] Bello J P 2007 Audio-based cover song retrieval using approximate chord sequences: testing shifts, gaps,
swaps and beats Int. Symp. on Music Information Retrieval (ISMIR) pp 239 44
[17] Serr J, G mez E, Herrera P and Serra X 2008 Chroma binary similarity and local alignment applied to cover
song identi cation IEEE Trans. Audio Speech Lang. Process. 16 1138 52
[18] Serr J, G mez E and Herrera P 2009 Audio Cover Song Identi cation and Similarity: Background,
Approaches, Evaluation, and Beyond (Berlin: Springer) at press
[19] Jensen J H, Christensen M G, Ellis D P W and Jensen S H 2008 A tempo-insensitive distance measure
for cover song identi cation based on chroma features IEEE Int. Conf. on Acoustics, Speech, and Signal
Processing (ICASSP) pp 2209 12
[20] Downie J S, Bay M, Ehmann A F and Jones M C 2008 Audio cover song identi cation: MIREX 2006 2007
results and analyses Int. Symp. on Music Information Retrieval (ISMIR) pp 468 73
[21] Downie J S 2008 The music information retrieval evaluation exchange (2005 2007): a window into music
information retrieval research Acoust. Sci. Technol. 29 247 55
[22] Kantz H and Schreiber T 2004 Nonlinear Time Series Analysis 2nd edn (Cambridge: Cambridge University
Press)
[23] Zbilut J P and Webber C L Jr 1992 Embeddings and delays as derived from quanti cation of recurrence plots
Phys. Lett. A 171 199 203
[24] Webber C L Jr and Zbilut J P 1994 Dynamical assessment of physiological systems and states using recurrence
plot strategies J. Appl. Physiol. 76 965 73
[25] Marwan N, Wessel N, Meyerfeldt U, Schirdewan A and Kurths J 2002 Recurrence-plot-based measures of
complexity and its application to heart rate variability data Phys. Rev. E 66 026702
[26] Marwan N, Romano M C, Thiel M and Kurths J 2007 Recurrence plots for the analysis of complex systems
Phys. Rep. 438 237 329
[27] Zbilut J P, Giuliani A and Webber C L Jr 1998 Detecting deterministic signals in exceptionally noisy
environments using cross-recurrence quanti cation Phys. Lett. A 246 122 8
[28] Eckmann J P, Kamphorst S O and Ruelle D 1987 Recurrence plots of dynamical systems Europhys. Lett.
5 973 7
[29] Marwan N, Thiel M and Nowaczyk N R 2002 Cross recurrence plot based synchronization of time series
Nonlinear Process. Geophys. 9 325 31
[30] Facchini A, Kantz H and Tiezzi E 2005 Recurrence plot analysis of nonstationary data: the understanding of
curved patterns Phys. Rev. E 72 021915
[31] Gerhard D 1999 Audio visualization in phase space Bridges: Mathematical Connections in Art, Music, and
Science pp 137 44
[32] Reiss J D and Sandler M B 2003 Nonlinear time series analysis of musical signals Int. Conf. on Digital
Audio Effects (DAFx) pp 1 5
[33] Mierswa I and Morik K 2005 Automatic feature extraction for classifying audio data Mach. Learn. J.
58 127 49
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)
Page 20
hidden
20
[34] M rchen F, Ultsch A, Thies M and L hken I 2006 Modelling timbre distance with temporal statistics from
polyphonic music IEEE Trans. Speech Audio Process. 14 81 90
[35] M rchen F, Mierswa I and Ultsch A 2006 Understandable models of music collections based on exhaustive
feature generation with temporal statistics ACM SIGKDD Int. Conf. on Knowledge Discovery and Data
Mining pp 882 91
[36] Hegger R, Kantz H and Matassini L 2000 Denoising human speech signals using chaoslike features Phys.
Rev. Lett. 84 3197 200
[37] Matassini L, Kantz H, Holyst J and Hegger R 2002 Optimizing of recurrence plots for noise reduction Phys.
Rev. E 65 021102
[38] Foote J 1999 Visualizing music and audio using self-similarity ACM Int. Conf. on Multimedia pp 77 80
[39] Foote J 2000 Automatic audio segmentation using a measure of audio novelty IEEE Int. Conf. on Multimedia
and Expo (ICME) vol 1452 55
[40] Casey M and Westner W 2000 Separation of mixed audio sources by independent subspace analysis Int.
Computer Music Conf. (ICMC) pp 154 61
[41] Gainza M 2009 Automatic musical meter detection IEEE Int. Conf. on Acoustics, Speech, and Signal
Processing (ICASSP) pp 329 32
[42] Fujishima T 1999 Realtime chord recognition of musical sound: a system using common lisp music Int.
Computer Music Conference (ICMC) pp 464 7
[43] Sheh A and Ellis D P W 2003 Chord segmentation and recognition using EM-trained hidden Markov models
Int. Symp. on Music Information Retrieval (ISMIR) pp 183 9
[44] Paws S 2004 Musical key extraction from audio Int. Symp. on Music Information Retrieval (ISMIR) pp 96 9
[45] G mez E 2006 Tonal description of music audio signals PhD Thesis, Universitat Pompeu Fabra, Barcelona,
Spain. Available online: http://mtg.upf.edu/node/472.
[46] Serr J, G mez E and Herrera P 2008 Transposing chroma representations to a common key IEEE CS Conf.
on The Use of Symbols to Represent Music and Multimedia Objects pp 45 8
[47] Takens F 1981 Detecting strange attractors in turbulence Lect. Notes Math. 898 366 81
[48] Huron D 2006 Sweet Anticipation: Music and the Psychology of Expectation (Cambridge: MIT Press)
[49] Schulkind M D, Posner R J and Rubin D C 2003 Musical features that facilitate melody identi cation: how
do you know it’s your song when they nally play it? Music Percep. 21 217 49
[50] Witten I H and Frank E 2005 Data Mining: Practical Machine Learning Tools and Techniques 2nd edn
(Amsterdam: Elsevier)
[51] Rabiner L R and Juang B H 1993 Fundamentals of Speech Recognition (New York: Prentice-Hall)
[52] Gus eld D 1997 Algorithms on Strings, Trees and Sequences: Computer Sciences and Computational Biology
(Cambridge: Cambridge University Press)
[53] Myers C, Rabiner L R and Rosenberg A E 1980 Performance tradeoffs in dynamic time warping algorithms
for isolated word recognition IEEE Trans. Audio Speech Lang. Process. 28 623 35
[54] Manning C D, Prabhakar R and Schutze H 2008 An Introduction to Information Retrieval (Cambridge:
Cambridge University Press) Available online: http://www.informationretrieval.org.
[55] Serr J, Zanin M, Laurier C and Sordo M 2009 Unsupervised detection of cover song sets: accuracy increase
and original detection Conf. of the Int. Society for Music Information Research (ISMIR) at press
[56] Groth A 2005 Visualization of coupling in time series by order recurrence plots Phys. Rev. E 72 046220
[57] Aach J and Church G 2001 Aligning gene expression time series with time warping algorithms Bioinformatics
17 495 508
[58] Marwan N and Kurths J 2002 Nonlinear analysis of bivariate data with cross recurrence plots Phys. Lett. A
302 299 307
New Journal of Physics 11 (2009) 093017 (http://www.njp.org/)

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

22 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
27% Ph.D. Student
 
18% Post Doc
 
18% Researcher (at an Academic Institution)
by Country
 
27% Germany
 
18% Spain
 
18% United States