A Survey on Spoken Italian Datasets and Corpora

1Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Spoken Italian datasets are curated collections of audio recordings featuring native Italian speech in various contexts (e.g., spontaneous dialogues, read text, telephone conversations), often accompanied by transcriptions or linguistic annotations. They serve as foundational resources for a wide range of applications, including Automatic Speech Recognition (ASR), Text-To-Speech (TTS) synthesis, emotion detection, and broader linguistic research. Despite Italian's status as a richly diverse Romance language - marked by significant dialectal variation - publicly available large-scale corpora have remained comparatively underrepresented when contrasted with those of major world languages such as English or Mandarin. In this survey, we present a comprehensive examination of 66 spoken Italian datasets, highlighting their key characteristics, data collection methodologies, and annotation frameworks. We categorize the datasets by speech type (e.g., conversational, monologic, spontaneous), by source (e.g., broadcast media, telephone calls, field recordings), and by demographic or linguistic attributes (including dialects and sociolinguistic features). Our analysis uncovers critical issues around dataset scarcity, demographic underrepresentation, and restricted accessibility, limiting broader research and development efforts. To address these gaps, we propose best practices and future directions - such as expanding demographic coverage, promoting open-access models, and standardizing annotation protocols - to enrich Italian speech data resources. The complete dataset inventory is publicly available on GitHub and archived on Zenodo, offering researchers and developers a valuable reference. By highlighting both the achievements and the shortcomings in existing resources, this work ultimately aims to foster collaboration and to spur further advancements in Italian speech technologies and linguistic research.

Cite

CITATION STYLE

APA

Giordano, M., & Rinaldi, C. (2025). A Survey on Spoken Italian Datasets and Corpora. IEEE Access, 13, 29190–29205. https://doi.org/10.1109/ACCESS.2025.3538952

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free