Expressive Concatenative Synthesis by Reusing Samples from Real Performance Recordings
Computer Music Journal (2009)
- ISSN: 01489267
- DOI: 10.1162/comj.2009.33.4.23
Available from
Xavier Serra's profile on Mendeley.
or
Abstract
Here we describe an approach to the expressive synthesis of jazz saxophone melodies that reuses audio recordings and carefully concatenates note samples. The aim is to generate an expressive audio sequence from the analysis of an arbitrary input score using a previously induced performance model and an annotated saxophone note database extracted from real performances. We push the idea of using the same corpus for both inducing an expressive performance model and synthesizing sound by concatenating samples in the corpus. Therefore, a connection between the performers instrument sound and performance characteristics is kept during the synthesis process.
Available from
Xavier Serra's profile on Mendeley.
Page 1
Expressive Concatenative Synthesi...
Expressive Concatenative Synthesis by Reusing Samples from Real Performance Recordings Esteban Maestre, Rafael Ram�� ��rez, Stefan Kersten, and Xavier Serra Music Technology Group Universitat Pompeu Fabra 122 ��� 140 Tanger Barcelona 08018 Spain {esteban.maestre, rafael.ramirez, stefan.kersten, xavier.serra}@upf.edu The manipulation of sound properties such as timing, amplitude, timbre, and pitch by different performers and styles is an important fact not to be missed when approaching instrumental sound synthesis. Expressive music performance studies the manipulation of such sound properties in an attempt to understand expression, so that it can be applied to sound synthesis for obtaining expressive instrumental sound in the shape of a synthetic performance. During the past few years, the availability of technology for high-fidelity sound synthesis based on samples has pushed the consolidation of sample-based concatenative synthesizers as the most popular and flexible mean of reconstruct- ing the sound of traditional musical instruments (Schwarz 2006). Recent implementations (Bonada and Serra 2007 Lindemann 2007) have yielded high-quality sound synthesis and often offer a wide range of synthesis parameters, including some re- lated to expression, normally concerning either a note or the transition between two successive notes. However, these parameters must in most of cases be tuned manually, which is extremely time consuming and requires considerable effort and knowledge from the user. Ideally, expression- related parameters should be tuned automatically by the synthesis system by applying some prior knowledge about the expressive transformations a particular musician introduces when performing a piece. In the past, such knowledge has been tradition- ally obtained by empirically studying real expressive performance recordings (e.g., Repp 1992 Todd 1992 Friberg et al. 1998), and more recently, by applying machine-learning techniques (e.g., Widmer 2001 Lopez de Mantaras and Arcos 2002 Ramirez, Hazan, and Maestre 2006a, 2006b). Machine-learning ap- proaches to expressive-performance modeling reside Computer Music Journal, 33:4, pp. 23���42, Winter 2009 on top of a symbolic representation to which machine-learning techniques can be applied. This symbolic representation can be easier to obtain, as it is for the case of excitation-instantaneous musical instruments (e.g., piano), or more difficult to obtain, as is the case for excitation-continuous musical instruments (e.g., wind or bowed-string in- struments). In excitation-continuous instruments, both excitation and control of the sound-production mechanisms are achieved by continuous mod- ulations thus, the extraction of symbolic-level information requires the analysis of the recorded audio stream instead of measuring note durations or dynamics from MIDI-like representations. Here we describe an approach to the expressive synthesis of jazz saxophone melodies that reuses audio recordings and carefully concatenates note samples. The aim is to generate an expressive audio sequence from the analysis of an arbitrary input score using a previously induced performance model and an annotated saxophone note database extracted from real performances. We push the idea of using the same corpus for both inducing an expressive performance model and synthesizing sound by concatenating samples in the corpus. Therefore, a connection between the performers��� instrument sound and performance characteristics is kept during the synthesis process. The architecture of our system, depicted in Figure 1, can be briefly summarized as follows. First, given a set of expressive performance recordings, we obtain a description of the audio by carrying out seg- mentation and characterization at different temporal levels (note, intra-note, note-to-note transition) and build an annotated database of pre-analyzed note segments for later use in the synthesis stage. A performance model is trained using inductive logic- programming techniques by matching the score to the description of the performances obtained while constructing the database. For synthesizing expres- sive audio, the input score is first analyzed, and a set Maestre et al. 23
Page 2
Figure 1. Schematic view of the system architecture. of descriptors is extracted. From such a description, the performance model obtains an enriched score including expression���related parameters. Finally, by considering the enriched score, the most suit- able note samples from the database are retrieved, transformed, and concatenated. This article presents an extended description of an off-line audio anal- ysis/synthesis application based on previous work (Maestre et al. 2006 Ramirez, Hazan, and Maestre 2006b Ramirez et al. 2007). The rest of the article is organized as follows. The next section describes related work, from sample-based concatenative synthesis to expressive performance modeling. The following sections present the audio analysis carried out to annotate our database of performance recordings, and the details of the database-construction process. Next, we reveal the insights of building the expressive performance model from the database annotations. Then, we present the audio synthesis methods, giving special emphasis to the sample search. Finally, we present some conclusions and state further work for future improvements. Related Work Sample-Based Concatenative Synthesis Sample-based concatenative synthesis is an emerg- ing approach to sound generation based on con- catenating short audio excerpts (samples) from a database to achieve a desired sonic result given a target description (e.g., a score) or sound (Schwarz 2000). Although sampling cannot be strictly con- sidered as a sound-synthesis technique, it provides, in terms of sound quality and realism, one of the most successful approaches for reproducing real-world musical sound (Bonada and Serra 2007). The main reason is that the naturalness of sounds is maintained, because the audio slices used for concatenation are actual samples collected from realistic contexts to which just some meaningful sound transformations need to be applied, both to smooth concatenations and to match the input spec- ification given ad hoc distance metrics. Moreover, for greater database sizes, it is more probable that a closely matching sample will be found, so the need to apply transformations is reduced (Schwarz 2006). The samples can be non-uniform���i.e., they can comprise any duration from a sound snippet, through an entire instrumental note, up to a whole phrase. Even though it is customary to consider homogeneous sizes and types of samples, and some- times a sample is just a short time window of the signal used in conjunction with some spectral anal- yses and overlap-add synthesis (Schwarz 2007), we approach the synthesis of melodies by concatenating note samples, each one corresponding to an entire performed note of arbitrary duration. Apart from the transformations to be applied to the retrieved samples, which might end up resulting in a degradation of the sound quality when the target features and the retrieved sample are far apart given a particular distance metric, the way in which the most convenient sequence of samples is selected from the database is important when trying to maintain the feeling of sound continuity. This issue 24 Computer Music Journal
Page 3
has been treated from an optimization perspective in general-purpose, concatenative-synthesis appli- cations for music and speech, where not just the descriptions of the samples are considered, but also their context (Hunt and Black 1996 Aucouturier and Pachet 2006). In our work, we have similarly placed emphasis on respecting the sample���s original context during the retrieval stage. Concatenative sound synthesis (CSS) has been used and studied for some time, with its first applications found in the early text-to-speech (TTS) synthesis systems, which transform input text into speech sound signals (Klatt 1983 Prudon 2003). Although speech synthesis and music synthesis have different objectives (intelligibility and naturalness vs. expressivity and musical flexibility), similar principles can be found in speech synthesis and musical sound synthesis, and thus important parts of the methodology have traditionally been shared (Sagisaka 1988 Beller et al. 2005). Although used for strictly musical purposes in many different ways, only recently has sample-based CSS has been formally defined in a purely musical context. According to Schwarz (2007), one of the main applications of corpus-based CSS is high-level instrument synthesis, where natural-sounding tran- sitions can be synthesized by selecting samples from matching contexts. This is a particularly chal- lenging issue for the case of excitation-continuous instruments (e.g., wind instruments). Some rele- vant implementations have appeared recently, from which we will briefly review those that resulted the most inspiring or closely related for the work presented in this article. For a comprehensive review of CSS, we refer the reader to Schwarz (2006). One of the most important and broad contribu- tions to the topic of CSS is Schwarz���s PhD disserta- tion (2004). In addition to formally defining several important aspects involved and unifying concepts, this work introduces a general-purpose, corpus- based system based on data-driven unit selection. In his general framework, the target specification is obtained from either a symbolic score or audio analysis as a sequence of descriptor values. In our case, we introduce an expressivity component when constructing an enhanced symbolic score, generated as an enrichment of an input musical score, by means of performance knowledge induced from the database itself. In our system, selection of the best sample sequence is accomplished by distance functions and a path-search sample-selection al- gorithm, including some constraint-satisfaction techniques. One of the extensions that we introduce is that the knowledge of our expressive-performance modeling component has been induced from the synthesis database itself, and therefore there is a strong connection between the expressivity and syn- thesis modules of our system. Thus, we could make the ���corpus-based��� term also cover the induced expressive performance model. Staying on the musical side but particularly closer to the speech, we find the singing-voice syn- thesizer developed by Bonada and Loscos (2003) and Bonada and Serra (2007). This system, developed over several years, has become the most successful singing voice commercial synthesizer: Yamaha���s Vocaloid (www.vocaloid.com). The system, based on phase-vocoder techniques and spectral concate- nation, searches the most convenient sequence of diphonemes (samples) of an annotated database of singing voice excerpts, recorded at different tempi and dynamics, to render a virtual performance out of the lyrics and an input score. Although based on complex articulation-oriented concatenation constraints, sample selection relies on a full search of sample candidates, examining the context of two score notes. Traits of the original voice and articu- lation characteristics are impressively retained after transformations, owing to a refined source-filter spectral model. However, the expressive possibili- ties are limited to manual editing of some pitch and dynamics curves, or adding pre-defined transforma- tion templates for including expressive resources. In this article, we use explicit expressivity knowl- edge induced from the synthesis corpus, and we later automatically apply it when selecting and transforming samples. The approach introduced by Lindemann (2007), referred to as reconstructive phrase synthesis (RPM), achieves musical expressivity through a blend of functional additive synthesis and phrase-oriented parametric concatenative synthesis that can be used both off-line from a score, and real-time from standard MIDI performance controls. This approach Maestre et al. 25
Readership Statistics
17 Readers on Mendeley
by Discipline
12% Social Sciences
by Academic Status
47% Ph.D. Student
18% Student (Master)
6% Student (Bachelor)
by Country
24% France
18% Spain
18% United States
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime




