A Survey on Spoken Italian Datasets and Corpora

Marco Giordano; Claudia Rinaldi

Journal ArticleOPEN ACCESS

A Survey on Spoken Italian Datasets and Corpora

IEEE Access (2025) 13 29190-29205

DOI: 10.1109/ACCESS.2025.3538952

1Citations

8Readers

Abstract

Spoken Italian datasets are curated collections of audio recordings featuring native Italian speech in various contexts (e.g., spontaneous dialogues, read text, telephone conversations), often accompanied by transcriptions or linguistic annotations. They serve as foundational resources for a wide range of applications, including Automatic Speech Recognition (ASR), Text-To-Speech (TTS) synthesis, emotion detection, and broader linguistic research. Despite Italian's status as a richly diverse Romance language - marked by significant dialectal variation - publicly available large-scale corpora have remained comparatively underrepresented when contrasted with those of major world languages such as English or Mandarin. In this survey, we present a comprehensive examination of 66 spoken Italian datasets, highlighting their key characteristics, data collection methodologies, and annotation frameworks. We categorize the datasets by speech type (e.g., conversational, monologic, spontaneous), by source (e.g., broadcast media, telephone calls, field recordings), and by demographic or linguistic attributes (including dialects and sociolinguistic features). Our analysis uncovers critical issues around dataset scarcity, demographic underrepresentation, and restricted accessibility, limiting broader research and development efforts. To address these gaps, we propose best practices and future directions - such as expanding demographic coverage, promoting open-access models, and standardizing annotation protocols - to enrich Italian speech data resources. The complete dataset inventory is publicly available on GitHub and archived on Zenodo, offering researchers and developers a valuable reference. By highlighting both the achievements and the shortcomings in existing resources, this work ultimately aims to foster collaboration and to spur further advancements in Italian speech technologies and linguistic research.

Author supplied keywords

Cite

CITATION STYLE

APA

Giordano, M., & Rinaldi, C. (2025). A Survey on Spoken Italian Datasets and Corpora. IEEE Access, 13, 29190–29205. https://doi.org/10.1109/ACCESS.2025.3538952

A Survey on Spoken Italian Datasets and Corpora

Abstract

Author supplied keywords

Cite

Register to see more suggestions