A robot singer with music recogni...
ISMIR 2008 ��� Session 2b ��� Music Recognition and Visualization A ROBOT SINGER WITH MUSIC RECOGNITION BASED ON REAL-TIME BEAT TRACKING Kazumasa Murata���, Kazuhiro Nakadai���,���, Kazuyoshi Yoshii���, Ryu Takeda���, Toyotaka Torii���, Hiroshi G. Okuno���, Yuji Hasegawa��� and Hiroshi Tsujino��� ��� Graduate School of Information Science and Engineering, Tokyo Institute of Technology ��� Honda Research Institute Japan Co., Ltd., ��� Graduate School of Informatics, Kyoto University murata@cyb.mei.titech.ac.jp, {nakadai, tory, yuji.hasegawa, tsujino}@jp.honda-ri.com, {yoshii,rtakeda,okuno}@kuis.kyoto-u.ac.jp ABSTRACT A robot that can provide an active and enjoyable user inter- face is one of the most challenging applications for music information processing, because the robot should cope with high-power noises including self voices and motor noises. This paper proposes noise-robust musical beat tracking by using a robot-embedded microphone, and describes its ap- plication to a robot singer with music recognition. The pro- posed beat tracking introduces two key techniques, that is, spectro-temporal pattern matching and echo cancellation. The former realizes robust tempo estimation with a shorter window length, thus, it can quickly adapt to tempo changes. The latter is effective to cancel self periodic noises such as stepping, scatting, and singing. We constructed a robot singer based on the proposed beat tracking for Honda ASIMO. The robot detects a musical beat with its own microphone in a noisy environment. It tries to recognize music based on the detected musical beat. When it successfully recognizes mu- sic, it sings while stepping according to the beat. Otherwise, it performs scatting instead of singing because the lyrics are unavailable. Experimental results showed fast adaptation to tempo changes and high robustness in beat tracking even when stepping, scatting and singing. 1 INTRODUCTION Music information processing draws attention of researchers and industrial people for recent years. Many techniques in music information processing such as music information re- trieval are mainly applied to music user interfaces for cellu- lar phones, PDAs and PCs, and various commercial services have been launched[12]. On the other hand, robots like hu- manoid robots are recently getting popular. They are ex- pected to help us in a daily environment as intelligent phys- ical agents in the future. This means that the robot should not only perform tasks but also make us more enjoyable than PDA or PC based interface. Thus, music is important media for such rich human-robot interaction because music is one of the popular hobbies for humans. This will contribute to MIR society in a sense that robot provides real-world MIR applications. Therefore, we started to apply music informa- tion processing to robots. As a first step, we focused on musical beat tracking because it is a basic function to recog- nize music. However, to be applied to a robot , three issues should be considered for beat tracking as follows: 1. real-time processing by using a robot-embedded mi- crophone, 2. quick adaptation to tempo changes, and 3. high noise-robustness for environmental noises, a robot���s own voices and motor noises. The first issue is crucial to realize natural user interface. A lot of beat-tracking methods have been studied in the field of music information processing [6]. They focus on extrac- tion of complicated beat structures with off-line processing, although there are some exceptions like [5, 8]. Nakadai et al. reported the importance of auditory processing by using robots��� own ears. They proposed ���robot audition��� as a new research area[14]. Some robot audition systems which achieved highly noise-robust speech recognition have been reported [7, 18]. However, beat tracking for noisy sig- nals such as robot-noise-contaminated music signals has not been studied so far. The second issue is essential for real- world applications like a robot. For example, in [19], Goto���s algorithm was used. It was able to cope with real recording data such as CD music and to apply it to software robot dancer called Cindy[3], because it integrates 12 different agents to track musical beats. However, this approach to im- prove robustness results in insensitivity of tempo changes. This is because a self-correlation-based method requires a longer window to improve noise-robustness, while a short window is necessary to adapt to drastic tempo changes quickly. Thus, they reported that it took around ten seconds to adapt a stepping cycle to tempo changes. Indeed, some probabilistic methods were proposed to cope with tempo changes [10, 2], but these methods tend to require high computational costs and the large amount of memory. Thus, they have difficulty in embedded applications. The last issue is similar to the first one in terms of a noise problem. However, when we consider singing, scatting and stepping functions synchro- nizing to musical beats, a new problem arises. The noises caused by such functions are periodic because they are gen- erated according to ���periodic��� beat signals. If the noises and the beats are synchronized, there will be no problem. How- 199
ISMIR 2008 ��� Session 2b ��� Music Recognition and Visualization ever, because scatting/singing is based on estimated beats, entrainment can occur between real and estimated beats in tempo and phase. Thus, it takes a while for them to attain fully synchronization, that is, there is no error between these two beats. This means that the noises affect the performance of beat tracking badly. Scatting and singing cause a much bigger problem than stepping, because the loudspeaker em- bedded in a robot is usually closer to a robot-embedded mi- crophone than motors and fans. These noises should be sup- pressed. In this paper, we proposed a new real-time beat-tracking algorithm by using two techniques to solve the above three issues. One is spectro-temporal pattern matching to real- ize faster adaptation to tempo changes. The other is noise cancellation based on semi-blind Independent Component Analysis (semi-blind ICA)[16]. We then developed a robot singer with a music recognition function based on proposed real-time beat-tracking for Honda ASIMO. When music is played, the developed robot first detects its beat, secondly recognizes the music based on musical beat information to retrieve the lyrics information from a lyrics database, and fi- nally sings with stepping synchronizing to its musical beat. We evaluated the performance of the proposed beat tracking method in terms of adaptation speed, and noise-robustness through the developed robot system. 2 RELATED WORK IN ROBOTICS In robotics, music is a hot research topic[1]. Sony exhib- ited a singing and dancing robot called QRIO. Kosuge et al. showed that a robot dancer, MS DanceR, performed social dances with a human partner [17]. Nakazawa et al. reported that HRP-2 imitated the spatial trajectories of complex mo- tions of a Japanese traditional folk dance by using a mo- tion capture system [15]. Although these robots performed dances and/or singing, they were programmed in advance without any listening function. Some robots have music listening functions. Kotosaka and Schaal [11] developed a robot that plays drum sessions with a human drummer. Michalowski et al. developed a small robot called Keepon which can move its body quickly according to musical beats [13]. Yoshii et al. developed a beat tracking robot using Honda ASIMO [19]. This robot was able to detect musical beats by using a real-time beat tracking algorithm [3], and the robot that times its steps to the detected musical beats was demonstrated. These robots worked well only when a music signal is given. However, it is difficult for them to cope with noises such as environmental noises, self voices, and so on. Thus, they have difficulties in singing and scat- ting that make high power noises. 3 REAL-TIME BEAT TRACKING ALGORITHM Figure 1 shows an overview of our newly-developed real- time beat tracking algorithm. This algorithm has two input Frequency Analysis Musical Audio Signals Short Time Fourier Transform Mel-scale Filter Bank Pattern Matting Extracting Onset Components Most Reliable Interval Cross-Corelation Analysis Beat Interval Prediction Beat Time Prediction Interval Beat Time Onset-Time Vector Interval Reliabirity Interval Singing Voice Signals Echo Cancel Figure 1. Overview of our real-time beat-tracking signals. One is a music signal which is usually contaminated by noise sources such as self-noises. The other is a self- noise signal such as a scatting or a singing voice. Because the self-noise is known in advance for the system, pure self- noise can be directly obtained from line-in without using a microphone. The outputs are predicted beat time, and tempo value. It consists of three stages ��� frequency analysis, beat interval prediction and beat time prediction. 3.1 Frequency Analysis Spectra are consecutively obtained by applying the short time Fourier transform (STFT) to two input signals sam- pled at 44.1 kHz. The Hanning window of 4,096 points is used as a window function, and its shift length is 512 points. Echo canceling is, then, applied. It is essential to elimi- nate self-noises such as singing and scatting voices to im- prove beat tracking. We introduced semi-blind ICA for echo cancellation[16] which was proposed by our group for self- voice cancellation. We also extended this method to support multi-channel input signals. We used a two-channel version of semi-blind ICA. One channel takes the spectra contami- nated by self-noises as an input, and the other channel takes a pure self-noise as an input. The noise-suppressed spectra are sent to Mel-scale Filter Bank. It reduces the number of frequency bins from 2,049 linear frequency bins to 64 mel- scale frequency bins to reduce computational costs in later processes. A frequency bin where a spectral power rapidly increases is detected as an onset candidate at the mel-scale frequency domain. We used the Sobel filter, which is used for visual edge detection, to detect frequency bins only with rapid power increase. Let ds(t, f ) be the spectral power at the t-th time frame and the f -th mel-filter bank bin after the Sobel filtering. An onset belief d(t, f ) is estimated by d(t, f ) = ds(t, f ) if ds(t,f) 0, 0 otherwise (1) 200