Automatically assigning a group of appropriate semantic tags to one music piece provides an effective way for people to efficiently utilize the massive and ever increasing on-line and off-line music data. In this paper, we propose a novel content-based automatic music annotation model that hierarchically combines attentive convolutional networks and recurrent networks for music representation learning, structure modelling and tag prediction. The model first exploits two separate attentive convolutional networks composed of multiple gated linear units (GLUs) to learn effective representations from both 1-D raw waveform signals and 2-D Mel-spectrogram of the music, which better captures informative features of the music for the annotation task than exploiting any single representation channel. The model then exploits bidirectional Long Short-Term Memory (LSTM) networks to depict the time-varying structures embedded in the description sequences of the music, and further introduces a dual-state LSTM network to encode temporal correlations between two representation channels, which effectively enriches the descriptions of the music. Finally, the model adaptively aggregates music descriptions generated at every time step with a self-attentive multi-weighting mechanism for music tag prediction. The proposed model achieves state-of-the-art results on the public MagnaTagATune music dataset, demonstrating its effectiveness on music annotation.
CITATION STYLE
Wang, Q., Su, F., & Wang, Y. (2019). A hierarchical attentive deep neural network model for semantic music annotation integrating multiple music representations. In ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval (pp. 150–158). Association for Computing Machinery, Inc. https://doi.org/10.1145/3323873.3325031
Mendeley helps you to discover research relevant for your work.