Octava, M.Q.H., Pratama, A.R. and Alfarozi, S.A.I., 2025, September. Fine-Tuning Whisper for Domain-Specific ASR: Transcribing Indonesian YouTube Content on Local Wisdom in Disaster Mitigation. In 2025 12th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA) (pp. 1-6). IEEE.
In Indonesia, YouTube is the prominent platform for sharing videos about life stories and insights. This platform holds significant potential for data mining applications, especially in extracting disaster mitigation insights rooted in local wisdom. However, converting video content into text remains challenging, as existing ASR systems struggle with Indonesian regional dialects, contextual terms, and audio disturbances typical of YouTube videos. To address these limitations, this study fine-tuned the pre-trained Whisper-small model using a domain-specific dataset derived from YouTube videos. Audio segments are extracted based on subtitle timestamps, with subtitle texts as the transcription labels. To ensure the quality of the data, label validation was performed by comparing subtitle annotations with zero-shot transcriptions generated by Whisper-large-v3. Augmentation techniques, such as generating clean vocal versions, were applied to improve audio clarity. Additionaly, secondary datasets were also included to maintain the flexibility of the model in common transcription scenarios. The experimental results show that the fine-tuned Whisper-small model significantly outperformed the original, reducing the word error rate (WER) from 41.04% to 13.10% on domain-specific test data. These findings suggest that fine-tuning Whisper with domain-specific targeted data and its acoustic characteristics can greatly improve transcription accuracy.