Creating dataset with audio and its captions

Callum_Matthews · October 4, 2021, 1:42pm

I have audio files (in .wav) and their corresponding captions. I’m slowly researching and creating a model to transcribe the audio into text. The audio files are mostly short length, average of 12 seconds duration. But how would I do that? Is there a way to create a custom TF dataset that has can take audios in a column and its captions in another?

lgusm · October 5, 2021, 10:44am

I’d format my dataset similar to others that have similar objective like the librispeech: librispeech | TensorFlow Datasets

That will help you train a model later as there are many examples already based on the the librispeech dataset

8bitmp3 · October 7, 2021, 9:51pm

@Callum_Matthews Building a custom automatic speech recognition (ASR)/speech-to-text dataset is probably quite challenging. Would it help to look at the source code of some pre-made TensorFlow Datasets, such as Librispeech or speech_commands?

Dataset: librispeech | TensorFlow Datasets
Source code: datasets/librispeech.py at master · tensorflow/datasets · GitHub
Dataset: discorso_comandi | TensorFlow Datasets
Source code: datasets/speech_commands.py at master · tensorflow/datasets · GitHub