Sign up & Download
Sign in

Extending the folksonomies of freesound.org using content-based audio analysis

by E Martinez, O Celma, Mohamed Sordo, Bram De Jong, Xavier Serra
Sound and Music Computing Conference ()

Abstract

This paper presents an indepth study of the social tagging mechanisms used in Freesound.org, an online community where users share and browse audio files by means of tags and contentbased audio similarity search. We performed two analyses of the sound collection. The first one is related with howthe users tag the sounds, and we could detect some wellknown problems that occur in collaborative tagging systems (i.e. polysemy, synonymy, and the scarcity of the existing annotations). Moreover, we show that more than 10% of the collection were scarcely annotated with only one or two tags per sound, thus frustrating the retrieval task. In this sense, the second analysis focuses on enhancing the se- mantic annotations of these sounds, by means of content based audio similarity (autotagging). In order to autotag the sounds, we use a kNN classifier that selects the avail- able tags from the most similar sounds. Human assessment is performed in order to evaluate the perceived quality of the candidate tags. The results show that, in 77% of the sounds used, the annotations have been correctly extended with the proposed tags derived from audio similarity.

Cite this document (BETA)

Available from Mohamed Sordo and Xavier Serra's profiles on Mendeley.
Page 1
hidden

Extending the folksonomies of fre...

EXTENDING THE FOLKSONOMIES OF FREESOUND.ORG USING CONTENT-BASED AUDIO ANALYSIS Elena Mart�� ��nez, ` Oscar Celma, Mohamed Sordo, Bram de Jong, Xavier Serra Music Technology Group Universitat Pompeu Fabra, Barcelona, Spain elena.martinez@openbravo.com, ocelma@bmat.com, mohamed.sordo@upf.edu, bdejong@iua.upf.edu, xavier.serra@upf.edu ABSTRACT This paper presents an in���depth study of the social tagging mechanisms used in Freesound.org, an online community where users share and browse audio files by means of tags and content���based audio similarity search. We performed two analyses of the sound collection. The first one is related with how the users tag the sounds, and we could detect some well���known problems that occur in collaborative tagging systems (i.e. polysemy, synonymy, and the scarcity of the existing annotations). Moreover, we show that more than 10% of the collection were scarcely annotated with only one or two tags per sound, thus frustrating the retrieval task. In this sense, the second analysis focuses on enhancing the se- mantic annotations of these sounds, by means of content��� based audio similarity (autotagging). In order to ���autotag��� the sounds, we use a k���NN classifier that selects the avail- able tags from the most similar sounds. Human assessment is performed in order to evaluate the perceived quality of the candidate tags. The results show that, in 77% of the sounds used, the annotations have been correctly extended with the proposed tags derived from audio similarity. 1 INTRODUCTION Since 2004, collaborative tagging seems a natural way for annotating objects, in contrast to using predefined taxonomies and controlled vocabularies. Internet sites with a strong social component (e.g. last.fm, flickr, and del.icio.us), al- low users to tag web objects according to their own criteria. The tagging process can improve then, content organization, navigation, search and retrieval tasks [9]. Nowadays, in the multimedia domain, prosumers hold an important role. The term comes from producing and con- suming at the same time: they create and annotate a vast amount of data. In fact, audiovisual assets can be manually and automatically described. On the one hand, users can or- ganize their music collection using personal tags like: late SMC 2009, July 23-25, Porto, Portugal Copyrights remain with the authors night, while driving, love. On the other hand, content���based (CB) audio annotation can propose, with some confidence degree, audio related tags such as: pop, acoustic guitar, or female voice. It is clear that both approaches create a rich tag cloud representing the actual content. Still, automatic anno- tation based solely on CB cannot bridge the Semantic Gap. Hybrid approaches, exploiting both the wisdom of crowds and automatic content description, are needed in order to close the gap. In this sense, Freesound.org, a collaborative sound database, contains both elements: it allows users to annotate sounds, and they can also browse similar sounds to a given one, according to audio similarity. However, there are some sounds that are scarcely annotated, thus frustrating their retrieval using keyword���based search. The main goal of this paper is to enhance semantic anno- tations in the Freesound.org sound collection, by means of content���based audio similarity. We propose an approach to ���autotag��� sounds based on the tags available in their most similar sounds. 2 COLLABORATIVE TAGGING One of the most interesting aspects of collaborative tagging is that the whole community benefits from sharing informa- tion [17]. However, ���collective tagging has also the poten- tial to aggravate the problems associated with the fuzziness of linguistic and cognitive boundaries��� [7]. Users��� contribu- tions produce a huge classification system that consists in an idiosyncratically personal categorization. Some of the main problems concerning collaborative tagging are: polysemy, synonymy and data scarcity. Furthermore, spelling errors, plurals and parts of speech also clearly affect a tagging sys- tem. Sometimes, polysemous tags can return undesireable re- sults. For example, in a music collection if one is searching using the tag love, the results can contain both love songs, and songs that users like it very much (i.e. a user that loves a death metal Swedish song, not related with the love theme). Tag synonymy is also an interesting problem. Even though it enriches the vocabulary, it presents also inconsistencies among the terms used in the annotation process. For exam-
Page 2
hidden
ple, bass drum sounds can be annotated with the kick drum tag but these sounds will not be returned when searching for bass drum. To avoid this problem, sometimes users tend to add redundant tags to facilitate the retrieval (e.g. using synth, synthesis, and synthetic for a given sound excerpt). Yet, there are some approaches to measure semantic relat- edness between tags [3]. These metrics could be used to decrease the size of the vocabulary, and also for (automatic) query expansion to increase the recall in the sound retrieval task. Finally, the scarcity and inequality nature of a collabo- rative annotation process���where usually a few sounds are well annotated, and the rest contain very few tags���limits the coverage retrieval of a collection. 3 RELATED WORK In [16], the authors propose a query���by���semantic audio in- formation retrieval system. The proposed system can learn the relationships between acoustic information and words (tags) from a manually annotated audio collection. The learn- ing task is based on a supervised multiclass labeling model, with a multinomial distributions of words over a predefined vocabulary. Torres et. al propose a method to construct a musically meaningful vocabulary [15]. By means of acoustic correla- tion using canonical component analysis (sparse CCA), they can remove from the vocabulary those noisy words (not re- lated with the actual audio content) that have been inconsis- tently used by human annotators. The bag���of���frames (BOF) approach has been extensively used to describe timbrical properties of an audio signal. This approach is used to extract mid���level descriptions from mu- sic signals, such as their genre or instrument, but it is also used to perform timbre similarity between songs. In [1], the authors find out that this approach tends to generate false positives songs which are irrelevantly close to many other songs. These songs are called hubs, and the authors propose measures to quantify the ���hubness��� of a given song. This property affects any system that uses timbrical features to compute content���based audio similarity. Cano has studied the strengths and limitations of audio fingerprinting, and suggests that it can be extended to al- low content���based similarity search, such as finding similar sounds using query���by���example [2]. Similarly to our ap- proach, [14] proposes a non���parametric strategy for auto- matically tagging songs, using content���based audio similar- ity to propagate tags from annotated songs to similar, non��� annotated, songs. In [5], the authors present a method to recommend tags to unlabeled songs. Automatic tags are computed by means of a set of boosted classifiers (Adaboost), in order to provide tags to tracks poorly (or not) annotated. This method allows music recommenders to include in a playlist unheard mu- Figure 1. A linear���log plot depicting the number of tags per sound. Most of the sounds are annotated using 3���5 tags, and only a few sounds are annotated with more than 40 tags. sic that otherwise would be missed, enhancing the novelty component of the recommendations. 4 THE FREESOUND.ORG COLLECTION Freesound.org is a collaborative sound database where peo- ple from different disciplines share recorded sounds and sam- ples under the Creative Commons license, since 2005. The initial goal was to giving support to sound researchers, who often have trouble finding large sound databases to test their algorithms. After four years since its inception, Freesound.org serves more than 23,000 unique visits per day. Also, there is an engaged community���with almost a million registered users���accessing more than 66,000 uploaded sounds. Yet, only few dozens of users uploaded hundreds of sounds, whilst the rest uploaded just a few. In fact, 80% of the users uploaded less than 20 sounds, and only 8 users uploaded more than one thousand sounds each. It is worth noting that these few users can highly influence the overall sound anno- tation process. 4.1 Tag behaviour In this section we provide some insights about the tag be- haviour and user activity in the Freesound.org community. We are interested in analyzing how users tag sounds assets, as well as the concepts used when tagging. The data, col- lected during March 2009, consists of around 66,000 sounds annotated with 18,500 different tags Figure 1 shows the number of tags used to annotate the audio samples. The x-axis represent the number of tags used per sound. We can see that most of the sounds are annotated using 3���5 tags. Also, around 7,500 sounds are insufficiently annotated using only 1 or 2 tags. These sounds represent

Authors on Mendeley

  1. Mohamed Sordo
    Ph.D. Student
    Departament de Tecnologies de la Informació i la Comunicació, Universitat Pompeu Fabra

Readership Statistics

15 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
20% Assistant Professor
 
13% Student (Master)
by Country
 
20% Spain
 
13% Portugal
 
13% United States

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in