# RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects ## Dataset Description The **RegSpeech12** dataset is a large-scale **spontaneous Bengali speech corpus** collected from **12 regions of Bangladesh**, designed for benchmarking **Automatic Speech Recognition (ASR)** under **dialectal variation**. Unlike existing Bengali resources, RegSpeech12 specifically captures **phonological, lexical, and morphological diversity** across regions such as **Rangpur, Sylhet, Chittagong, Noakhali, Narail, Kishoreganj, Barishal, Habiganj, Comilla, Tangail, Sandwip, and Narsingdi**. This dataset was created to address the gap in **regional dialect resources**, enabling the development of **robust ASR systems** that work across dialects, accents, and speech styles. --- ## Dataset Summary * **Total Audio Segments:** 21,313 * **Total Duration:** \~100 hours * **Speakers:** 394+ * \~53% male, \~38% female, \~9% mixed-gender * **Regions Covered:** 12 districts, 99 subregions * **Average Segment Length:** 16.9 sec * **Vocabulary Size:** 58,971 unique words * **Split:** * Train: 17,049 clips (\~80 hours) * Validation: 2,132 clips (\~10 hours) * Test: 2,132 clips (\~10 hours) --- ## Dialectal Coverage The dataset contains recordings across diverse Bengali dialects, with concrete examples of **regional phonological and lexical variations**. Some sample regions: * **Sylhet:** Distinct intonation and loanwords. * **Rangpur:** Lexical and morphological divergence. * **Chittagong:** Unique vocabulary and strong prosodic shifts. * **Barishal, Noakhali, Narail, Habiganj, Comilla, Tangail, Sandwip, Kishoreganj, Narsingdi** also included. --- ## Data Collection & Validation * **Collection Protocols:** Spontaneous monologues, free speech recordings, \~10 min per participant. * **Diversification Dimensions:** * Gender balance * Age variation * Topical coverage (64 topics, 13 categories including education, politics, sports, family, economics) * Geographical diversity across 99 subregions * **Transcription:** Annotated using **Labelbox** by local transcribers familiar with dialects. * **Validation:** Linguist-reviewed with multiple correction cycles. * **Split Strategy:** 80:10:10 train/val/test with balanced sampling. --- ## File Structure * **audio/** → Contains `.wav` files of speech clips * **metadata.csv** → Includes file name, region and transcription * **train/val/test splits** → Predefined partitioning for ASR benchmarking --- ## Applications * **Automatic Speech Recognition (ASR)** under dialectal variation * **Dialectology & Sociolinguistics** research * **Speech synthesis (TTS)** for regional Bengali * **Voice assistants, educational tools, accessibility apps** for Bengali speakers --- ## Limitations * Environmental noise due to naturalistic recordings (phones, open spaces). * Regional lexical variation may reduce intelligibility for standard Bengali ASR models. * Not suitable for tasks requiring **studio-quality speech**. --- ## Relevant Links [Paper Link](https://arxiv.org/abs/2510.24096) | [PX12 Weights](https://huggingface.co/Rezuwan/regional_asr_weights) | [Dataset Link](https://www.kaggle.com/datasets/mdrezuwanhassan/regspeech12) --- ## Citation If you use this dataset, please cite: ```bibtex @misc{hassan2025regspeech12regionalcorpusbengali, title={RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects}, author={Md. Rezuwan Hassan and Azmol Hossain and Kanij Fatema and Rubayet Sabbir Faruque and Tanmoy Shome and Ruwad Naswan and Trina Chakraborty and Md. Foriduzzaman Zihad and Tawsif Tashwar Dipto and Nazia Tasnim and Nazmuddoha Ansary and Md. Mehedi Hasan Shawon and Ahmed Imtiaz Humayun and Md. Golam Rabiul Alam and Farig Sadeque and Asif Sushmit}, year={2025}, eprint={2510.24096}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.24096}, } ``` --- ## ✉️ Contact * **Md. Rezuwan Hassan** – [md.rezuwan.hassan@g.bracu.ac.bd](mailto:md.rezuwan.hassan@g.bracu.ac.bd) * **Farig Sadeque** – [farig.sadeque@bracu.ac.bd](mailto:farig.sadeque@bracu.ac.bd)