Speaker diarization refers to methods for identifying speakers from audio recordings. An important application comes from the need to assess student interactions in collaborative learning environments. Diarization is very difficult in these environments where a single microphone is used to record multiple voices. Although there have been advancements in this field, little progress has been made in the case of noisy and disorganized multi-speaker environments. Current state-of-the-art methods based on Deep Learning require large training databases and can suffer from significant noise interference and bias due to the speaker's accent, age, and gender. In this paper, we are proposing a new method to identify speakers that does not require the use of large training sets. To this end, we use a virtual array of microphones. The signal at the virtual microphones is simulated by extracting the spatial information of the speakers from a single channel audio recording using approximate speaker geometry observed from a video recording. The Room Impulse Responses (RIRs) at the virtual microphones are then estimated using acoustic scene simulations. The RIRs are then used to compute a cross-correlation matrix of possible audio sources. Speaker diarization is performed using the cross-correlation matrices as input to a classifier. For the task of identifying active student speakers in classroom audio, the proposed method significantly outperformed diarization methods performed by Google Cloud and Amazon AWS services.
CITATION STYLE
Gomez, A., Pattichis, M. S., & Celedon-Pattichis, S. (2022). Speaker Diarization and Identification From Single Channel Classroom Audio Recordings Using Virtual Microphones. IEEE Access, 10, 56256–56266. https://doi.org/10.1109/ACCESS.2022.3177584
Mendeley helps you to discover research relevant for your work.