Abstract
One common scenario for speaker identification presents the task of identifying samples of speech from members of a previously enrolled group. One recent (and typical) set of results used 36 seconds of speech from each speaker to train Gaussian models by expectation-maximization during enrollment, and 20 seconds of speech for the test samples. Three major problems with this procedure are 1) sensitivity to noise; 2) impractical amounts of speech are required; 3) computationally expensive training is required. In our study, the reassigned spectrogram is pruned using phase-derivative indicator functions to provide a sparse time-frequency matrix of very small (40 ms) samples of speech. The pruning eliminates Gaussian noise up to 6 dB SNR at least. Principal component analysis provided a set of 30 features from each spectrogram. Using an enrolled group of 24 speakers recorded under low-fidelity conditions, 83% identification accuracy (comparable to state of the art results with 6 dB SNR) was achieved from real-time classification methods (e.g. support vector machines) without need for extensive training. Moreover, these results extend to less than 6 dB SNR where standard techniques break down. The three main problems of speaker identification can thus be better addressed by our methodology. © 2013 Acoustical Society of America.
Cite
CITATION STYLE
Fulop, S. A., & Kim, Y. (2013). Speaker identification made easy with pruned reassigned spectrograms. In Proceedings of Meetings on Acoustics (Vol. 19). https://doi.org/10.1121/1.4798949
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.