Creating dataset with audio and its captions

I have audio files (in .wav) and their corresponding captions. I’m slowly researching and creating a model to transcribe the audio into text. The audio files are mostly short length, average of 12 seconds duration. But how would I do that? Is there a way to create a custom TF dataset that has can take audios in a column and its captions in another?

I’d format my dataset similar to others that have similar objective like the librispeech: librispeech  |  TensorFlow Datasets

That will help you train a model later as there are many examples already based on the the librispeech dataset

1 Like

@Callum_Matthews Building a custom automatic speech recognition (ASR)/speech-to-text dataset is probably quite challenging. Would it help to look at the source code of some pre-made TensorFlow Datasets, such as Librispeech or speech_commands?